IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
  Related link:  
     IBM Life Sciences  
IBM Journal of Research and Development  
Volume 45, Numbers 3/4, 2001
Deep computing for the life sciences
 Table of contents: arrowHTML arrowPDF arrowASCII   This article: HTML arrowPDF arrowASCII   DOI: 10.1147/rd.453.0533 arrowCopyright info
   

QSAR in grossly underdetermined systems: Opportunities and issues

by D. E. Platt, L. Parida, Y. Gao, A. Floratos, and I. Rigoutsos
Regression in grossly underdetermined systems has emerged as an important means for understanding molecular activity via comparative molecular field analysis (CoMFA) and other quantitative structure activity relationship (QSAR) studies. But this methodology has applications in much broader areas; for example, near-infrared spectroscopy, mutational enzyme activity studies including protein folding rates to determine which sites are important for determining conformation, and analyses of gene expression data from chip arrays. An error analysis which answers questions concerning the quality of the predictivity, the relative importance of each descriptor, the quality of the estimates of the contribution by each descriptor, and the number of independent components expressed by the associated data is indispensable in understanding whether some particular set of structure variables is important in defining the mechanisms driving the chemical or biological activities. This paper reviews opportunities for QSAR studies. It also considers the analytical aspects of error analysis in least-squares regression, and contrasts principal component regression (PCR) and partial least-squares (PLS) procedures with cross-validation on the issues of error analysis (e.g., the quality of the contribution estimates for each structure descriptor). Further, a methodology for selecting optimal subsets of components in PCR is presented.

1. Introduction

In chemistry, linear regression [1] has provided means for identifying which structural features may be important in determining chemical activity over a group of reactants in a poorly understood interaction. This is achieved by forming a linear relationship between those variables that describe the structural variation within the group of reactants and those that describe the activities of the reactants. The relationship is denoted as a quantitative structure activity relationship (QSAR). QSAR studies have therefore formed a close association with combinatorial chemistry studies in which variations in activities caused by the systematic modifications of structures can yield insight into the reaction activity mechanism.

A particularly interesting example of such a QSAR is comparative molecular field analysis (CoMFA) [2], in which reactants with common structural backbones varying by residue substitutions may be aligned with one another, and physical characteristics such as electrostatic potential and steric energies may be measured for each reactant on a common grid with thousands of points. This is doubly interesting because it explicitly uses descriptors that depend only on the 3D character of a molecule rather than any information on its 1D topology.

CoMFA QSAR regression systems are grossly underdetermined. However, it should be expected that the variations at the various field points should be correlated well enough (because of the relatively small number of residue substitutions as well as the discrete character of residue substitution in combinatoric studies) that a meaningful relationship between those residue substitutions and the activities may be determined. Then those grid points around the residue sites that are important to the determination of the activity will make large contributions to the regression.

One procedure that treats such a grossly underdetermined system is the PLS procedure [3], which was popularized in the CoMFA program of the TRIPOS SYBYL package.1 The procedure has, since this popularization, emerged as a standard of analysis in numerous research publications. The first applications of CoMFA have also emerged as benchmarks by which other 3D QSARs are measured [2, 4–12].

The application of QSARs to spectral data has a character similar to that of CoMFA. In this case, the spectra are sampled on a one-dimensional lattice of wavelengths. The amplitudes become structure variables for QSAR computations. The application of QSARs to spectroscopy, including near-infrared spectroscopy, shares the overdetermined character of CoMFA studies [13] with a good review by Faber and Kowalski [14].

QSARs have been applied to gene expression analysis in determining levels of gene expression as a function of descriptors of pharmacological substrates or treatment levels [15]. One particular application of regression of particular interest to gene expression studies in general involves the exploration of the relationship between transcriptional and translational control of gene expression [16]. One future application may be predicting survival of alleles, such as cancer survival rates [17], or perhaps of fermentation by yeast alleles [18] as a function of the levels of expression measured by gene array chips.

Yet another area of increasing interest is that of quantitative structure–property relationships (QSPRs). This is the prediction of physical characteristics such as boiling points, vapor pressure, critical temperature, critical micelle concentrations, and polymer–glass transitions [19]. A possible candidate for application could be the phenomenological exploration of protein folding times. Enzyme mechanism is often elucidated by the examination of mutation activities, just as in combinatoric chemistry studies. Such studies have also been performed on protein folding rates [20]. This form of study is very consistent with standard combinatoric QSAR studies, and represents an opportunity for exploration by QSAR techniques.

But while PLS is a computationally easy method of computing regression coefficients for systems with large numbers of independent variables, the computation of the coefficients is nonlinear in the dependent variable. This implies that error propagation may be performed as a linearized variation, reviewed by Faber and Kowalski [14]. This is frustrating to researchers because the quality of the predictions becomes more difficult to understand. Such an understanding should include the following:

  • The range of variation in the regression coefficients that would be consistent with the data.
  • The consistency or predictive power of the model with the data as measured by the ability of the structure variables to predict the activities.
  • The number of components in the independent variables that actually carry statistical predictive power.
  • The contribution of each descriptor to the activity.

One solution to this problem that is also incorporated into SYBYL is cross-validation [21, 22], in which multiple regressions are performed on datasets with various data points excluded from each regression. The variations between regressions for predicted coefficients yield a measure of the range of variation that is consistent with the data, and the quality of the regression is measured by the predicted vs. expected error squared and summed (PRESS) of the activities compared to the variance of the activities. (See Appendix A.) This compares the regression model to a prediction by constant value. PLS does not directly determine regression coefficients. Instead, it determines coefficients for each of a sequence of PLS components. Each of the subsequent PLS components is essentially the correlation between the dependent and independent variables, after the previous component has been projected out of the data. (See Appendix B.) This stops when all of the projections are zero, which happens when, or before, the number of points is reached, or the number of independent variables is reached (whichever is smaller). The number of PLS components that produces the best cross-validation is not usually the one in which all of the components are projected out. Instead, the best cross-validation occurs at some smaller number of components that represents some measure of the actual number of independent variables contributing to the prediction of the data.

PCR procedures [23, 24] emerge naturally from the quadratic structure of the least-squares problem. The expression for the sum of the squares of the error between predicted and expected values may be expressed as a quadratic form in the regression coefficients. The principal components are orthogonal combinations of the data that diagonalize the coefficient quadratic form. Error propagation in PCR is straightforward and well understood [23, 24]. Further, a least-squares chi2 statistic to provide a measure of the goodness of fit based on a probability model is commonly used [24–26]. The probability model in minimum chi2 estimation assumes that the random deviations of the data from the linear regression model are Gaussian.

There are two reasons why PCR has not been adapted for use in CoMFA and other QSAR packages and studies. Most important is that PCR requires the diagonalization of a large matrix. If the number of sampled grid points is in the thousands, the matrix to be diagonalized has a number of components in the millions.

The second reason is a little more subtle. Cross-validation techniques have shown that the best predictivity is not achieved with the inclusion of all of the components, but is usually achieved with some subset. PLS provides only one order of extracted components. However, if there are N independent PCR components, there are 2N possible subsets that can be constructed from them. This can be prohibitively large to compute. One of the major problems in principal component regression is the selection of an appropriate subset of components. The solution presented by this paper is finding a set of components that extremizes the chi2 probability, as is typical of some other solutions [23].

There is a difference between cross-validation as a measure of predictivity and chi2 as a measure of goodness of fit. Goodness of fit measures the consistency of a regression model with data. Cross-validation predictivity measures the extent to which a training set provides information that can predict other unknown data. Practically, this means that goodness of fit includes all points in the set for computing the errors, and cross-validation excludes each point from the prediction computation. In any large dataset, the effect of including or excluding any one point should be diluted. However, the difference can be greatly magnified in any small QSAR set or grossly underdetermined dataset.

Modeling studies, either by linear regression or by more flexible techniques, generally fall into one of two non-exclusive groups: those seeking to understand and measure the contribution of some particular independent or control variables to some putative dependent variable, and those seeking to simply predict the dependent variable as well as possible given some posited control variables, but not necessarily very concerned with the structural consistency of the model itself with the data. Obviously these considerations are not independent or orthogonal, but their interests are sufficiently distinct that the preferred measures of performance tend to be different. Those who are primarily interested in the consistency of a model with the data, or who are interested in determining what the model parameters are, how stable they are, and what their uncertainties are will be more interested in goodness-of-fit considerations. Those who are more interested in prediction will tend to be more interested in predictivity measures and techniques such as bootstrapping, cross-validation, jackknifing, and resampling [27]. This distinction becomes more evident when the researchers interested in determining the parameters are more than willing to confront the complications associated with untangling the impact of correlations between the independent variables on the regression coefficients, whereas those interested in prediction may consider such questions to be unnecessary complications that do not add to understanding.

This paper was motivated by the growing opportunities for practice of the application of linear regression, particularly in grossly overdetermined systems, to biological problems, which were reviewed earlier in this section. However, the focus is on those problems associated with regression that are peculiar to the grossly overdetermined systems that are emerging in those biological applications, with an emphasis on the measurement of regression coefficients and goodness of fit.

2. Regression model

The regression model is expressed simply as

yi = summation
j
xijaj + ei,
(1)

where yi is the ith measurement out of N of the dependent variable (activity for the ith molecule in the dataset), xij is the ith measurement of the jth out of D independent variables (jth QSAR structure descriptor for the ith molecule in the dataset), aj is the jth regression coefficient, and ei is the error in the prediction of the ith dependent variable by the regression. Many regression studies use Greek letters to represent estimated regression parameters, but this usage is not universal [24, 26]. The expected values of ei are described as

E(ei)=0, (2)

E(eiej) = Deltay 2
i
deltaij.
(3)

If each of the errors ei is Gaussian distributed, the statistic

script E2 =  
summation
i
ei2
fraction bar
Deltayi2
=  
summation
i
1
fraction bar
Deltayi2
open parenthesis yi  
summation
j
xijaj close parenthesis 2
(4)

is chi2 distributed with N degrees of freedom [23–26]. This minimization of script E2 is the foundation of both PLS and PCR. It may be expressed in terms of matrices as

script E2 = (Xa – y)TC(Xa – y), (5)

where C is the diagonal matrix with elements (C)ij = deltaij/Deltayi2. This may be expressed alternatively as

script E2 = (a – a0)TXTCX(a – a0) + yTCy – a0TXTCXa0, (6)

where

a0= lim
epsilonarrow0
(XTCX + epsilonI) – 1XTCy.
(7)

The limit limepsilonarrow0(XTCX + epsilonI)–1XTC1/2 is called a “generalized inverse” of C1/2X. The limit
limepsilonarrow0(XTCX + epsilonI)–1 is undefined unless the matrix first operates on another matrix or vector which has no projections along eigenvectors of XTCX that correspond to eigenvalues equal to zero. However, if u is an eigenvector of XTCX with a zero eigenvalue (Xu)TC(Xu), it follows that any projection Xu of X along u will be zero, since XTCXu = 0 implies that uTXTCXu = (Xu)TC(Xu) = 0. Since C is diagonal, this implies that each (Xu)i = 0. Further, this implies that uTa0 = 0. Note that this solution is not unique. Any a'0 = a0 + deltaa, where XTCXdeltaa = 0, produces an equivalent script E2.

Since script E2 is a chi2 statistic, it follows that a0 and XTCX are essential statistics. Any changes in a may be accounted for by the contribution to the error script E2 by the coefficients

script E 2
coef
= (a – a0)TXTCX(a – a0),
(8)

with the remainder accounted for by the residual

script E 2
res
= yTCy – a0TXTCXa0,
(9)

so that

script E2 = script E 2
coef
+ script E 2
res
.
(10)

The total number of degrees of freedom in script E2 is N. The number of degrees of freedom in script E2coef is equal to the number D0 of eigenvectors with corresponding nonzero eigenvalues of XTCX. This leaves N – D0 degrees of freedom for script E2res. This partition is very reminiscent of Bayesian treatments of linear regression [28], but the presentation here follows “frequentist” notions of sampling theory.

The expectation value of a0 is

E(a0) = E open curly brace  
lim
epsilonarrow0
(XTCX + epsilonI) – 1XTCy close curly brace = a,
(11)

where a is the fixed parameter characterizing sample space, to within the previously described ambiguity. Both PLS and PCR solve the same minimization of script E2 if all components are employed, and so both obtain the same value of a0. The covariance is predicted by

cov(a0, a0) = E[(a – a0)(aa0)T] =  
lim
epsilonarrow0
(XTCX + epsilonI) – 1,
(12)

determined by the inverse of the quadratic coefficients in script E2coef. As pointed out before, this limit does not exist if there are eigenvalues of XTCX equal to zero. This means that any contribution to a of any magnitude in a direction corresponding to a null eigenvector of XTCX does not contribute anything to script E2, and implies that the coefficients essentially have an infinite uncertainty and are completely undetermined in any underdetermined system. This is just another reflection of the ambiguity in underdetermined systems.

A meaningful alternative measure of covariance is the amount by which the estimate of a will vary given the variations in y. This is essentially equivalent to the effect of allowing y to vary according to the variation in e, implying that

covsubspace(a0, a0) = E{a0[e]a0[e]T}
 
lim
epsilonarrow0
E{(XTCX + epsilonI) – 1XTCeeTCX(XTCX + epsilonI) – 1}
 
lim
epsilonarrow0
(XTCX + epsilonI) – 1XTCE(eeT)CX(XTCX + epsilonI) – 1
 
lim
epsilonarrow0
(XTCX + epsilonI) – 1XTCC – 1CX(XTCX + epsilonI) – 1,

or

covsubspace(a, a) =  
lim
epsilonarrow0
(XTCX + epsilonI) – 1XTCX(XTCX + epsilonI) –1 .
(13)

The limit does exist in this case because (XTCX + epsilonI)–1XTCX acts like a projection operator that picks out only those eigenvectors with nonzero eigenvalues. This expression compares favorably with the variation in the coefficients observed between the various regressions produced by cross-validation. Such a result constitutes an explicit measure of the stability of the coefficients to variations in the dependent variables.

It is important to realize that while some consistency may be expected within a dataset, and it is possible to ask whether a model is consistent with a dataset in a statistical sense, underdetermined systems do not yield definitive measurements of all of the coefficients. Comparison with other datasets that could ultimately produce a complete model if the data were combined would not produce coefficients consistent with one another.

Not only is it possible for a regression to be underdetermined in the sense of having eigenvalues equal to zero, but some of the eigenvalues of XTCX may be very small. Such a system is called “poorly conditioned.” This corresponds to some var(ai) being very large. Such terms can add spurious and large contributions to a0 without significantly affecting script E2, suggesting that it is desirable to exclude contributions from various subsets of components that may not correspond to zero-valued eigenvalues. The systematic consideration of the character of individual principal components in the analysis of the script E2 quadratic form is perhaps the best definition for principal component regression. This includes but is not limited to issues of conditioning of the regression equations. While PCR provides a direct way to examine questions such as the conditioning of a regression, PLS provides no direct way to consider the issue of whether a regression is poorly conditioned or not.

Consider a projection operator P that is a projection onto a subset K of eigenvectors uk of XTCX. As such, P satisfies

P =  
  summation
k member K
ukukT,
(14)

P2=P, (15)

[P, XTCX]=0. (16)

Now, it is desirable to partition script E2 in a different way along projections of Pa:

script E2[P] = (Pa – a0)TXTCX(Pa – a0) + yTCy – a T
0
XTCXa0.

Identifying the projection operator Q = I – P, and noting that

(Pa – a0)TXTCX(Pa – a0) = (Pa – Pa0)TXTCX(Pa – Pa0) + (Qa0)TXTCX(Qa0),

and that

(a0)TXTCX(a0) = (Pa0)TXTCX(Pa0) + (Qa0)TXTCX(Qa0),

it follows that

script E2[P] = script E 2
coef
[P] + script E 2
res
[P],
(17)

where

script E 2
coef
[P] = (Pa – Pa0)TPXTCXP(Pa – Pa0)
(18)

and

script E 2
res
[P] = yTCy – a0TPXTCXPa0.
(19)

This is a very interesting partition of the degrees of freedom. The operator P removes degrees of freedom from script E2coef[P] and essentially transfers them to script E2res[P]. Since the total number of degrees of freedom in script E2[P] remains N, and the number of degrees of freedom in script E2coef[P] is now DP, the total number of degrees of freedom in script E2res[P] is now N – DP. Further, while the number of degrees of freedom in script E2res[P] increases when poorly conditioned eigenvectors are excluded, so does the total value of script E2res[P]. The relationship between the goodness-of-fit error and the exclusion of particular components is well understood [23] and has been considered in comparisons between PLS and PCR [13, 14].

Partitioning the error according to contributions by projections of components has an immediate application, allowing comparison of the goodness of fit for different subsets of components. In particular, for a subset of components projected by P, the probability that a chi2 larger than this might be observed is P(chi2N – DP > script E2res[P]). Those with larger probabilities better represent the fit. Note that if N = DP, which happens when all of the non-null components are used in an underdetermined system, P(chi2N – DP > script E2res[P]) is undefined. There is essentially no statistical information about the quality of the fit if all of the principal components are included.

Further, the contributions of each individual component may also be determined. The contribution to script E2res[P] may be determined for any particular component k. For any component k with eigenvector uk, the projection operator is Pk = ukukT. This implies that the effect of any particular eigenvector is to subtract a variation

script E 2
k
= a T
0
uku T
k
XTCXuk ukTa0 = yTCX(XTCX + epsilonI)–1uku T
k
XTCXukukT(XTCX + epsilonI)–1XTCy

or

script E 2
k
 =  yTC(Xuk)(Xuk)TCy
fraction bar
(Xuk)TC(Xuk)
,
(20)

where Ak = (Xuk)TC(Xuk) is the eigenvalue of XTCX corresponding to eigenvector uk. The contribution of the kth component to cov(a, a) varies as 1/Ak. This is a reflection of how well conditioned the contribution from this component is. Small Ak components contain little discriminating information compared to the uncertainty they contribute to the regression coefficients. Exclusion of the smallest Ak contributions therefore improves the stability of the coefficients and reduces the size of the uncertainty in those parameters.

However, the largest script E2k contribute the most toward improving the goodness of fit, since they reduce script E2res[P] the most. Therefore, the value of script E2k represents the predictive power of the kth component. It is possible therefore to rank the components by predictive power. Consequently, it is possible to construct a list of subsets with the largest predictive power, then the next list containing the largest together with the second largest, and then the third list containing the top three predictive components, etc. This reduces the computation from all 2N possible subsets of components to a simple list of subsets N long. Once this is done, it is possible to compute
P(chi2N–DP > script E2res[P]) for each of the subsets. This probability generally passes through some extremum, which represents the optimal subset of components. Since the questions of the information in a component as measured by Ak and the contribution the component makes to the goodness of fit are distinct, exclusion of low-information components may be achieved by applying a cutoff to Ak. A selection of the most important contributors to the goodness of fit may then be applied.

This approach may be applied to the situation in which the size of the uncertainty is unknown and it is desired to estimate some best uncertainty from the regression of the data. This may be achieved by choosing C = I/DeltaY2, to yield

E(script E 2
res
[P]) = N – DP = 1
fraction bar
DeltaY2
(yTy – a T
0
PXTXPa0),
(21)

and solving for DeltaY2. The best subset is the one that produces the smallest DeltaY2. This component-selection criterion is essentially identical to one proposed by Lott [29], who also recognized the possibility of reducing the optimal space of subsets by ranking the components. However, the connection between the selection of an optimal subset and a minimum DeltaY2 was not established, and connection with chi2 was not explored. Generally, for overdetermined systems, DeltaY goes through a minimum as the number of components is decreased. The smallest set is the best. However, in underdetermined systems there tends to be no minimum in DeltaY. For a fixed DeltaY, there is usually some particular subset of components where P(chi2N–DP > script E2res[P]) is minimized. Once some DeltaY is selected and the component subset is extracted, the values of a0 and cov(a, a) that are consistent with the quality of the regression and the variation in the data may be computed.

Principal component regression follows the plan of principal component analysis (PCA): It is assumed that the solution space is best represented by some subset of components. But the problem of PCA is to find a subset of components that represents the data to within some limit of accuracy. In the case of PCA, the most important components are those with the largest variance. (See Appendix C.) The problem with selecting a subset that spans the space of variation in PCR is that the dependent variable may depend on some of the components with smaller variation: If the short axis is discarded, there is no dimension to describe the layering of its structure. This problem is well known, and there have been a number of solutions posed for selecting some optimal subset of components [23]. However, many commercial packages still rank components according to their variation Ak = (Xuk)TC(Xuk), which measures only the information in the component, and not the contribution of the component script Ek to the goodness of fit. This has led to some unfounded criticisms of PCR in various comparisons with PLS. The use of
P(chi2N–DP > script E2res[P]) as the measure of the quality of fit for component subsets presented here appears to be a new contribution.

One of the prohibitive costs of PCR computation is the diagonalization of the large matrix. However, since small Aks yield poorly conditioned regressions, and zero-valued Aks correspond to undetermined contributions not spanned by the dataset, it is appropriate to exclude them, as in PCA. This implies that a computational algorithm such as NIPALS [30] may be applied, which is much more efficient if only the first few components are desired.

3. Conclusions

Partial least-squares analysis and cross-validation have emerged as standards in analyzing 3D QSARs because of their simplicity of computation. Together, they provide methods for assessing

  • The range of variation in the regression coefficients that would be consistent with the data.
  • The consistency or predictive power of the model with the data as measured by the ability of the structure variables to predict the activities.
  • The number of components in the independent variables that actually carry statistical predictive power.
  • The contribution of each descriptor to the activity.

Since PLS and PCR address these issues differently, it is instructive to compare them. The most significant difference is that the contributions by the individual PCR components are immediately available, and the implications of each one with respect to each of the items in the above list are computable. PLS components are much more intermixed; none of the components have meaning outside of the context of the rest.

First, PLS allows for the examination of the variation between descriptor coefficients by comparing the cross-validation regressions. Usually variances and covariances in the coefficients are not provided, though estimates could be computed from the variation in cross-validation regressions. In PCR, the contribution of each component to the variances and covariances of the coefficients is immediately available. In either case, the availability of this information permits the assignment of error bars to estimates of activity.

Second, there is a difference between goodness-of-fit measurements as defined by
P(chi2N–DP > script E2res[P]) and the cross-validation predictivity. On one hand, the predictivity measures how well the set can predict data that are not included in the set. This implies that it measures extrapolative power, while goodness of fit measures the quality of consistency of the data, or the interpolative power. In this sense, at least in nondilute or data sparse sets, predictivity may appear to be more demanding than goodness of fit. However, the reference variation that is used in the predictivity measurement is the variance in the independent variables (activities), while the reference variation in the goodness of fit is the distribution of deviations of the model from the actual data. The errors in the dependent variable should be much more restrictive than the variation in the independent variable. If the error bars are larger than the total range of measured variation, the signal is indistinguishable from noise. Predictivity requires only that the regression perform better than the total variation in the dependent variable. Goodness-of-fit consistency requires that the regression perform better than the uncertainty in the dependent variable. Usually a regression that passes a predictivity test does not pass a consistency test to within experimental error. In most cases, it is better to estimate the error as an unknown, attributing some of its magnitude to variables that have not been accounted for in the regression model. In that case, the estimated variance is usually much smaller than the variance of the independent variables.

Third, the use of cross-validation implies that there is some variation between the regressions. It can be difficult to identify corresponding PCR components between separate cross-validation regressions, especially in grossly underdetermined systems. However, it is generally possible to identify which of the components of leading importance maintained their importance from regression to regression, as well as how much those components vary among regressions. In PLS, it is very difficult to understand the relationships among the components, much less how the decomposition varies among regressions and what that variation in the decomposition might mean. This exposes at least one problem with cross-validation; the regression to some extent compares unrelated components in choosing which models with which components most closely represent the data. The reason is that the least-squares solutions obtained by minimizing script E2 are not optimizations of predictivity measured by PRESS, since predictivity is based on points not included in the regression and thus not included in script E2. At least one implication of this fact is that it is difficult to assess the impact of excluded data on the regression coefficients analytically. The decomposition of PLS selects those parts of the independent variation that correlate most strongly with the dependent variations at each step. Usually the number of PLS components appears to be less than the number of PCR components. However, those components are distinct in their character and should not be compared directly with one another. Usually regressions requiring more PLS components than other regressions also require more PCR components than the others.

Fourth, the relative contribution of each of the descriptors to the variance of the dependent variable is presented in SYBYL as |ai| square root( var(xi)/var(y)). However, this does not take into account the effects of correlation among the descriptors within the sampled set. Further, this computation is also available within the context of PCR, and PCR allows for the identification of the importance of individual components to each regression by examining the projections Xuk for each principal component k, taken together with the predictive power of the component, as well as the relative contributions (amplitudes) of the descriptors to the components in uk.

Perhaps most important, it is possible to recognize when extrapolation data points contain information outside the subspace spanned by the data. For example, if uk corresponds to a zero eigenvalue, xTuk is a measure of the projection of that extrapolation candidate into uncharacterized space, where the contribution to the activity will be undefined. This information is not as directly available in PLS, even though it is possible to examine the residual in the independent variables after decomposition by the training set has been performed.

More, if uk corresponds to a poorly conditioned component, xTuk measures the projection into badly characterized space, whose contribution to the activity is poorly defined. This information is not available in PLS, and care must be exercised to recognize when spurious results follow from data that are poorly conditioned. A good example of this would be the inclusion of several descriptors which are constrained to sum to zero. PLS may recognize the round-off error as a part of the regression, yielding large coefficients for those descriptors.

One argument against PCR is that the components are an artifact of the space that happens to be sampled by the dataset. For example, the components themselves depend on how the independent variables were scaled, and the quality and predictive power of the components can be changed merely by changing the units of measurement. This shows up particularly in that measurement unit changes can change the activity predictions in using some number K of components. It is particularly true in the case of underdetermined systems, in which strong correlations between the descriptors may be induced simply by the small sample size compared to the number of descriptors. Such pathology happens to be true of PLS as well, but PLS hides its appearance in its decompositions more effectively. The technique most often adapted by both PCR and PLS is to rescale the descriptors by the root of the variance of that descriptor within the dataset; this tends to give a more balanced variation between descriptors.

Ultimately PCR and PLS are founded on the minimization of a sum-of-squares error expression. This paper has explored some of the analytic consequences of least-squares solutions as they apply to PLS and PCR, and has considered how they address those issues of error analysis that have emerged as being very important to QSAR studies.

Appendix A: Cross-validation

Since the number of degrees of freedom associated with individual PLS components is not known, the construction of a chi2 error associated with a truncated set of components is not meaningful. One way around that is to ask how well the individual, and it is assumed independent, points are predicted by a fit to all of the other points. This leads to the idea of cross-checking the fits of all of the data points via a method called “cross-validation.” Further, it is desirable to try to determine which components, or how many, are necessary to predict the data. By applying cross-validation to predictions using varying numbers of components, it is possible to determine the number of components that produces the best predictive capability. This approach partly motivated the development, presented in this paper, of a PCR component subset selection based on chi2 contributions that measured the probability that a linear model could describe a set of data consistent with errors of measurement.

Cross-validation is a technique that may be applied either to PLS or to PCR. The method cycles through the set of points, excluding one point from the computation of the fit parameters, and then constructing an error for the fit vs. actual excluded point. A sum-of-squares error is constructed for this and is used in an F-test to compare against the variance about the mean. If this were applied to PCR, it is not clear how many degrees of freedom should be assigned to the number of independent components that were kept for the fit, since each component varies from point to point for each of the predicted points that was excluded from the computation for that fit. For PLS, where each retained component does not contribute a chi2 degree of freedom in the first place, it is even more unclear how many degrees of freedom to assign to each retained component. However, common practice seems to be to assign one degree of freedom per component.

The statistics that apply follow. The predicted sum of squares PRESS is defined as

PRESS =  
summation
  i
[yiny hati]2,

where ny hati is the predicted value of y for the ith point using n PLS components. A q2 that is much like a correlation coefficient compares the ratio of PRESS to the variance

q2 = 1 – PRESS/N
fraction bar
sigmay2

where N is the number of points. The F statistic is constructed from

FN,N = 1 – q2 = PRESS/N
fraction bar
sigmay2
.

The numerator is essentially the variance from prediction; the denominator is the no-prediction (all values predicted by the mean of the independent variables) variance. If the denominator is a variance estimated about a sample mean, the numerator has a number of degrees of freedom associated with the number of data points N, but the denominator has the degrees of freedom associated with the same number of data points, but estimated about the mean N – 1. The actual F statistic should then be

FN,N–1 = PRESS/N
fraction bar
s2
,

where the s2 has an N – 1 in it.

The literature frequently cites q2 without interpretation in terms of F scores or probability estimates, thus avoiding the question of whether the number of degrees of freedom are defined. Assuming they are defined, the critical value for q2 given the degrees of freedom has the form

q2 = 1 – F.

At the 95% level, if N = 2, the value of F is F = 0.054017 for a q2 = 0.9459. For N = 6, F = 0.227927 for a q2 = 0.7731. For N = 10, F = 0.331084 for a q2 = 0.6689. For N = 100, F = 0.718046 for a q2 = 0.2819. Depending on the number of data points, the range over which a critical q2 can vary is significant. Frequently a q2 is cited in the literature with no sense of whether it reflects a statistically significant level.

Appendix B: Partial least squares

The approach taken here is to iteratively extract components each one of which satisfies the form of the least-squares equation. Each of the components is linearly independent of the previous ones, until no more components can be constructed [3].

In this case, just as in the PCR case, the coefficients a in a vector of length D in an expression of the form

xTa = yp,

for a particular vector x composed of D descriptors and a scalar yp, are determined by minimizing a chi2 error,

script E2 =  N
summation
 i=1
wi(xiTa – yi)2 = (Xa – y)TC(Xa – y),

where wi = 1/Deltayi2, C is diagonal with (C)ii = wi, y is now a vector N long, and X is an N × D matrix. The minimization of script E2 with respect to a yields

c identity XTCXa = XTCy

for a vector c of D components. This equation is to be solved for a. One problem is that the matrix multiplying a may be singular. If the goal is to extract the most important factor leading to y, something related to the correlation of the y with X should be examined. This might most simply be done by taking the projection along c, which is just that correlation. Then

||c|| = c hatTc = c hatTXTCy = c hatTXTCXa.

Defining u = Xc hat, ui = xiTc hat, and identifying

b identity uTCX
fraction bar
uTCu
,

it follows that

bc hat = 1

and

ba = uTCy
fraction bar
uTCu
.

If the preceding describes how the structure components line up with the response or activity data, that information may be used to extract the projection of this dominant leading component. A transformation is sought that subtracts displacements D from X such that X' = X – D are perpendicular to c hat. At the same time, there should be a deltay which, when subtracted from y, leads to the corresponding y' = y – deltay. The connection between X and y is through a in y = Xa, which is to be preserved in the projected components y' = X'a with the same a. This implies that the transformation will involve a deltay = Da. The only vector we have constructed that has a well-defined scalar product relationship with a is b. Therefore, we want to construct D such that Dc hat acts like a projection with bc hat. This may be done by seeking f such that D = fb and

X' = X – fb,

where

X'c hat = 0.

This is satisfied when

X'c hat = Xc hat – fbc hat = u – f = 0,

so

f = u

and

X' = X – ub.

Forming the inner product of this with a yields

X'a = Xa – uba.

Recognizing that y' = X'a, y = Xa, and ba = uTCy/uTCu, this simplifies to

y' = y – u uTCy
fraction bar
uTCu
.

Once the projected X and y have been obtained, these may be used to compute new c and b, and new projections obtained until c = 0. Since each new c is constructed from linear combinations perpendicular to all previous c, each c is perpendicular to all previous c. If fewer data points than components are present, the termination condition is met in a number of iterations smaller than the number of components. The order of subtraction proceeds from largest to smallest correlation with the y.

There is no explicit computation of a as so far defined. Instead, predictions are obtained by decomposing an unknown x by the set of c and b that were determined by the iterative application of the above projections on the training set. Each iteration folds the data back on itself nonlinearly. In order to predict a particular yp for some x, a reverse application of the above projections must be applied. For each iteration, x must be decomposed using

x' = x – vbT,

where v = xTc hat. The remaining x, after the complete decomposition, is a measure of the information in x that was not accounted for by the original dataset that was used to construct the least-squares decomposition. Starting with yp = 0, the new y must be composed,

yp = y'p + v uTCy
fraction bar
uTCu
,

for the b and c corresponding to the step of decomposition that was determined from the previous decomposition of the data.

Now, if

y = m1x1 + m2x2 + m3x3 + · · · ,

it follows that x = (1, 0, 0, · · · )will yield y = m1, x = (0, 1, 0, · · · ) will yield y = m2, · · · , etc.

The relationship between the error script E2 between steps extracting components may be extracted as follows. First,

script E2 = (Xa – y)TC(Xa – y)
= open parenthesis X'a – uba – y' + u uTCy
fraction bar
uTCu
close parenthesis T C open parenthesis X'a – uba – y' + u uTCy
fraction bar
uTCu
close parenthesis
= open parenthesis X'a – u uTCy
fraction bar
uTCu
– y' + u uTCy
fraction bar
uTCu
close parenthesis T C open parenthesis X'a – u uTCy
fraction bar
uTCu
– y' + u uTCy
fraction bar
uTCu
close parenthesis
= (X'a – y' )TC(X'a – y' ).

Therefore, the extraction of each component leaves the error unchanged. The error for a PLS regression is determined entirely by the residual for the last extracted component in the subset of components. This is entirely different in character from the notion in PCR that each component contributes a predictive capability that can be measured by its “predictive power” (see Section 2).

This method has some advantages. First, it produces leading contributing structure components that predict the strongest correlations with the activities in order. There is no need to compute any more than is desired.

However, there are a number of disadvantages. First, there is no really overt computation for a. Instead, PLS provides a decomposition technique. The dependence of the error script E2 on a is never explicitly determined. Further, even the contributions of each individual component of script E2 depend on the entire subset that was applied. The method is nonlinear in the activities, which makes a propagation of errors very difficult. Since the components are not orthogonal in quadratic contributions to the chi2 statistic script E2, it is impossible to determine the number of degrees of freedom that have been involved if some components are dropped, and therefore there is no internal goodness-of-fit statistic. Instead, a “cross-validation” technique [21, 22] is usually employed to try to extract this information. Further, the components must be picked off in order. It is impossible to measure the individual contributions of each component outside the decomposition sequence, so an independent articulation of the contribution of each component, as with eigenvectors, is impossible.

In conclusion, there is no direct estimation of a. The nonlinearity in y makes a propagation of errors difficult, and a direct way to compute cov(ai, aj) is unavailable. Since the components are not orthogonal in quadratic contributions to the chi2 statistic script E2, it is impossible to determine the number of degrees of freedom contributed by the components; therefore, there is no internal estimate of goodness-of-fit statistic. To paraphrase, while one can determine the simplest elements of the fit, it is not readily clear how much is known, how well it is known, or even whether something new is not known from previous experience.

Appendix C: Principal component analysis

The notion behind principal component analysis [23] is that data may be approximated most effectively by the leading principal components of the covariance matrix. This may be understood as follows. Consider a set of N vectors x vectori, which is represented in terms of variations about the mean

x vectori = xi vectori + X vector,

so that (1/N) Sigmai x vectori = X vector; and whose variation about said mean is expressed in terms of the eigenvectors of the matrix C = (1/N) Sigmai xi vectorixi vectoriT. These may be written

Cupsilon hatk = sigma 2
k
upsilon hatk.

Then, since the upsilon hatk span the space, the xi vectori may be expressed in terms of the upsilon hatk as

xi vectori =  
summation
k
cikupsilon hatk.

Since C is symmetric with real coefficients, the upsilon hatk are orthogonal. It follows that

cik = upsilon hat T
k
xi vectori

and that (1/N) Sigmai c2ik = (1/N) Sigmai upsilon hatTkxi vectorixi vectorTiupsilon hatk = upsilon hatTkCupsilon hatk = sigma2k. If sigmak = 0, it follows that cik = 0. Thus, the cik may be expressed as proportional to sigmak, or cik = uiksigmak, so that

uik = 1
fraction bar
sigmak
upsilon hat T
k
xi vectori (where sigmak not_equal 0)

and

xi vectori =  
summation
k,sigmaknot_equal0
sigmakuikupsilon hatk.

Then (1/N) Sigmai u2ik = 1. Next, the question is how much each of the components contributes to the representation of the data. In particular, if some subset S of the components is used, and its complement S' is excluded, how good is the approximation? We define an error


script E2(S)
= 1
fraction bar
N
 
summation
i
absolute value  
summation
k memberS
sigmakuikupsilon hatkxi vectori absolute value 2
= 1
fraction bar
N
 
summation
i
absolute value  
summation
k memberS'
sigmakuikupsilon hatk absolute value 2
= 1
fraction bar
N
 
summation
i
 
summation
k,k' memberS'
sigmak'sigmakuik'uikupsilon hat T
k'
upsilon hatk
= 1
fraction bar
N
 
summation
i
 
summation
k memberS'
sigmak2uik2
=  
  summation
k memberS'
sigma k2.

The error is then equal to the sum of the eigenvalues of the covariance matrix that were excluded from the approximation. If these are the smallest eigenvalues, the approximation shows a minimum error.

References

Footnote

1 Available from TRIPOS Associates Inc., 1699 S. Hanley Rd., St. Louis, MO 63144.

Received July 24, 2000; accepted for publication October 20, 2001