1. Introduction
Protein folding has fascinated theorists and experimenters for decades. This fascination has been driven not only by the phenomenon, being the simplest act of biological self-organization, but also by practical considerations relating to the desire to compute protein structure from sequence data. There are signs that the appropriate framework for understanding biomolecular self-organization is emerging. In considering folding theory versus structure-prediction practice, cynics, however, are likely to paraphrase the old saw Thermodynamics owes more to the steam engine than the steam engine owes to thermodynamics.
In this paper, we initially review how folding theory based on energy landscapes has already contributed to progress on structure-prediction algorithms. We then illustrate how the computational tools used in energy landscape approaches to the study of folding kinetics can help distinguish and evaluate different energy functions used in structure prediction and how this analysis helps to quantify where improvements are most needed.
One major contribution of the last decade's study of folding to the practice of prediction is to give predictors several additional reasons for optimism, a necessary emotional stance for people to work in this field! A decade ago most people quoted folding times as being seconds to minutes. Indeed, several theories from the 1970s used these numbers as a procrustean bed for setting parameters! This would make prediction by even slavishly accurate molecular simulation, guaranteed to work by Anfinsen's experiment [1, 2], out of the question economically. Deductions from theory and heroic laboratory experiments have shown that a good deal of the self-organization in folding occurs in microseconds. Thus, if sufficiently accurate all-atom potentials are available, the IBM Blue Gene Project1 should have a good chance to succeed in that the plan is to simulate accurately the laboratory situation.
The rapid folding without discrete traps observed experimentally suggests that the free-energy landscape (averaged over the solvent) of a protein is funnel-shaped [3, 4]. According to landscape theory [5, 6], structurally distinct traps emerge through inappropriate contacts arising from the inconsistency or frustration of different energy interaction terms. Evidence for the funnel point of view comes not just from detailed kinetics experiments [79], but also from the widespread anecdotal evidence that protein structures are robust to most single-site mutations. On a perfectly funneled landscape, barriers to folding are largely entropic and could be overcome by carrying out simulations at lower temperatures. Unfortunately, one cannot do this in complete all-atom models because there is frustration between solvent hydrogen bonds and van der Waals contacts in the protein. This leads to cold denaturationthe lowest energy (as opposed to free energy) state of protein plus solvent is actually an unfolded one. Thus, a Cool Blue Gene simulation with all atoms should not work any faster than a hot one. An important issue is whether solvent-averaged potentials as crafted by theorists, still with all atoms of the proteins represented, can then achieve funnel-like surfaces. The situation would be even better if reduced descriptions with less atomic detail could be used. Here there is also good news when theory and experiment are combined.
There are many indications that simplified energy functions with funnel-like landscapes can describe protein folding kinetics [1015]. The catch is that the beautiful kinetics results require making the funnel approximation as an extra assumption or filter in order to bring the numerical predictions into conformity with experiment. To predict protein structures from sequence data, one needs simplified energy functions that do not impose this constraint and therefore allow any possible contact. For prediction we cannot a priori limit the model by assuming native contacts to be overwhelmingly important. The widespread failure of structure-prediction schemes until recently, in fact, was prima facie evidence for these schemes being based on rough, i.e., non-funneled, energy landscapeslandscapes in which topologically distinct structures can be found to lie within a few kBT of each other near the folding temperature. Such states would represent traps in laboratory folding kinetics. They also lead to a lack of confidence about the result of a prediction, since different simulation runs should give each as a separate result. This situation is generically to be expected for random sequences using any given energy function. The trick is to find an energy function consistent with the one used in evolution. A structure-prediction scheme with a rugged non-funneled landscape for naturally evolved protein sequences is effectively hardly better than some random energy function. If proteins evolved to have a funnel-like landscape, there is also reason to believe that errors in potentials can be tolerated. Bryngelson showed that if the landscape is completely random, the potential must be known to an accuracy that scales as 1/ N, making it nearly impossible to predict structures of long proteins [16]. Using similar methods, Pande et al. showed that for a funneled, minimally frustrated landscape, the percent accuracy needed did not scale with length. Using estimates of natural protein landscape ruggedness derived from experiment, they found that an accuracy of only 2030% would be needed [17].
While a decade ago structure-prediction schemes based on simplified protein representations were nearly uniformly unsuccessful, some success can now be found. Some of the success can be viewed as arising from the development of schemes for emphasizing the funnel-like aspects of landscapes. Some of the success was explicitly motivated by the energy landscape paradigm, but sometimes the connection is only implicit. Homology modeling relies on using evolutionary information to reduce the importance of non-native-like associations [18], again assuming that evolution leads to minimal frustration. Energetic approaches for sequence-structure matching as a first step of modeling (threading) are improved by varying parameters to distinguish decoys from native structures [1924], again minimizing conflicts or frustration. This improvement can be carried out systematically using landscape-based algorithms. Averaging over multiple sequences smoothes out asperities on the landscape, again leading to a more globally funnel-like landscape [2528]. Associative memory Hamiltonians for structure prediction using molecular dynamics as well as more purely physically based Hamiltonians used in lattice calculations have been improved by making explicit use of the quantitative form of the minimal frustration principle to optimize their parameters [29].
That some success has been achieved, while heartening to those of us who are optimists, should not blind us to the possibilities of further improvements. This requires the development of quantitative evaluation methods for assessing prediction algorithms. To do this efficiently, we must go further than merely judging final results of our own calculations. Setting up social schemes for avoiding inevitably subjective comparisons is helpful but not sufficient by itself. Both of these latter issues are well addressed by the CASP experiments [30, 31]. One approach to determining how to improve structure prediction has been to exploit the same tools used for studying folding kinetics to describe the landscapes provided by potential energy functions used in structure prediction. While we have so far done this for understanding our own algorithms, in-house as it were, we believe the style of analysis can be used by others to help improve their own algorithms. We also hope to show in this paper how these schemes can be used to give an idea of the probability of success in multiple attempts to predict a structure using a stochastic algorithm.
The organization of the paper is as follows. In Section 2 we review the specific energy functions we use as well as the optimization schemes for relevant parameters. We then describe, in Section 3, sampling methods used to examine landscape topography. In Section 4 we discuss the resulting landscape for an ideal prediction scheme based on knowing the exact contact map of a protein, and for ab initio predictions, using associative memory energy functions which assume no homology information.
2. Model
While the method of analysis presented in this paper is generally applicable to any energy function, it is illustrated for an associative memory Hamiltonian in particular. In this section we describe the associative memory energy function. First we discuss the form of the energy function, and then we consider how the parameters in the energy function are optimized using energy landscape theory. Specific ab initio structure predictions from the energy function outlined here are discussed by Hardin et al. [32].
Associative memory energy function
The associative memory energy function discussed here has been designed with the objective of making ab initio predictions of protein structure. It was originally developed in the context of structure prediction by Friedrichs and Wolynes [33, 34], and is a reduced description, with only C , Cß, and O atoms explicitly represented. Search for low-energy states is carried out by molecular dynamics simulations. The full energy function is formally
where Eback describes the protein backbone, and Eamc comprises associative memory and contact terms, as discussed further below. The backbone part of the energy function is very similar to that used in previous studies [34, 35], and we relegate discussion of it to Appendix A.
Before considering the form of Eamc, we wish to make a comment regarding units. The unit of energy is denoted as , and is defined in terms of the native state energy excluding backbone contributions, ENamc, via
=
|
|ENamc|

4N
|
,
|
|
(2)
|
where N is the number of residues of the protein being considered. Temperatures are quoted in terms of the reduced temperature = kBT/ . Distances, r, are in units of angstroms, and is used to denote the dimensionless r/(1 Å).
The interactions described by Eamc depend on the sequence separation |i j| between the residues i and j involved. Specifically, they are divided into three proximity classes (|i j|): x = short (|i j| < 5), x = medium (5 |i j| 12), and x = long (|i j| > 12). Thus,
|
Eamc = Eshort + Emed + Elong.
|
(3)
|
Both the short- and medium-range interactions are treated by an associative memory energy function,
|
EAM
|
= Eshort + Emed
|
| |
|
=
|
|
Nmem
|
|
|
[Pi, Pj, Pi'µ, Pj'µ, x(|i j|)] × exp
|
|
|
(rij rµi'j')2
|
|
|
.
|
|
|
|
|
|
a
|
2 ij2
|
|
µ=1
|
j12 i j3
|
|
(4)
|
The sum over i and j runs over all unique pairs of atoms (C C , C Cß, CßC , CßCß) with sequence separation between 3 and 12, and rij is the distance between atoms i and j. The index µ runs over all Nmem memory proteins to which the protein has previously been aligned using a sequence-structure threading algorithm [36] (i.e., each ij pair in the protein has an i'j' pair associated with it in every memory protein; if, due to gaps in the alignment, there is no i'j' pair associated with ij for a particular memory, this memory protein simply makes no contribution to the interaction between residues i and j). The interaction between C and Cß atoms is thus a sum of Gaussian wells centered at the separations ri'j'µ of the corresponding memory atoms. The widths of the Gaussians are given by ij = |i j|0.15 Å. The weights given to each well are controlled by [Pi, Pj, Pi'µ, Pj'µ, x(|i j|)], which depends on the identities Pi' and Pj' of the residues to which i and j are aligned, as well as the identities Pi and Pj of i and j themselves. A reduced four-letter (as opposed to 20-letter) code is used for the identities: Pi = hydrophilic (i ala, gly, pro, ser, thr); Pi = hydrophobic (i cys, ile, leu, met, phe, trp, tyr, val); Pi = acidic (i asn, asp, gln, glu); and Pi = basic (i arg, his, lys). Since also depends on proximity class, there are thus 44 × 2 = 512 different parameters in the associative memory term. These are optimized using energy landscape ideas as discussed below. We finally note that a is a dimensionless constant chosen so that Equation (2) is satisfied.
The energy function in the long-range proximity class is given by a three-well contact potential,
|
Elong =
|
|
|
3
|
(Pi, Pj, k)ck(N)U[rmin(k), rmax(k), rij],
|
|
|
|
|
|
a
|
|
|
i < j 12
|
k=1
|
 |
 |
 |
 |
|
(5)
|
where i and j run only over all pairs of Cß atoms separated by more than 12 residues. The sum over k is over the three wells which are approximately square wells between rmin(k) and rmax(k). Specifically,
|
U[rmin(k),
rmax(k),
rij] =
|
1

4
|
{[1 + tanh(7[rijrmin(k)]/Å)]
×[1 + tanh(7[rmax(k)rij]/Å)]}.
|
|
(6)
|
The parameters [rmin(k), rmax(k)] are (4.5 Å, 8.0 Å), (8.0 Å, 10.0 Å), and (10.0 Å, 15.0 Å) for k = 1, 2, and 3, respectively. In order to approximately account for the variation of the probability distribution of pair distances with number of residues in the protein (N), a factor ck(N) has been included in Elong. It is given by c1 = 1.0, c2 = 1.0/(0.0065N + 0.87), and c3 = 1.0/(0.042N + 0.13). The individual wells are also weighted by parameters which depend on the identities of the amino acids involved, using the four-letter code defined above. Since we also enforce (Pi, Pj, k) = (Pj, Pi, k) for all i and j, and there are three wells, the number of parameters is 10 × 3 = 30 in addition to the 512 from the associative memory part of the Hamiltonian.
Optimization
The 542 linear parameters, , were optimized by training them on a set of ten proteins, using an optimization criterion explicitly based on energy landscape ideas. While the Hamiltonian above is rather general, it was trained for the specific task of making ab initio predictions for the structures of small to medium-sized alpha-helical proteins for which no structural homologs are known. Therefore, the training proteins were all alpha-helical proteins (ranging in size from 63 to 189 residues), as were the 36 memory proteins. Since our objective was to make predictions where structural homologs are not available, the memory proteins were also chosen to have a low structural (and sequence) similarity to the training proteins. Thus, it is a challenging problem to find the set of parameters that will give good predictions for the training proteins; however, we may have some confidence that having done so, the energy function will give good results for true unknowns outside the training set. The PDB codes of the training proteins and memories are given in Appendix B.
Full details of the optimization procedure have been presented most recently in Recference [32], and similar schemes have been discussed elsewhere [22, 29, 36, 37]; here we simply outline the main points. The basic idea of the approach is to find a that maximizes the funneling in the energy landscape of the training proteins. Specifically, we maximize the stability gap averaged over the training proteins, . The stability gap of a single protein Es( ) is defined as the average energy gap (excluding backbone terms) between its native state and a specified set of collapsed non-native structures (these globules are generated in the first instance by simply taking coordinates from fragments of larger proteins). The maximization of must be subject to at least one constraint. In the simplest version of the scheme [22], this is achieved by fixing the variance of the globule energies averaged over the training set, . While this constraint effectively just sets the energy scale in the optimization procedure, the physical motivation is transparent: For a single protein, Es/ ( E2) is the ratio of the native bias to the roughness of the landscape, i.e., a measure of the degree of funneling. It has also been shown that for a random energy model, maximization of this ratio is essentially equivalent to maximization of the ratio of folding to glass transition temperatures (Tf/Tg) [22].
In the current optimization procedure, the single constraint on the roughness is replaced by three separate constraints on the contribution to the roughness coming from the three proximity classes. The motivation behind this is to prevent the occurrence of a glass transition at high temperature within an individual class. For example, if the roughness in the short-range class were very large, the local in-sequence structure of the protein could become frozen (partially incorrectly), rendering the protein rigid before longer-range interactions have a chance to form. There are also three additional constraints on the contribution to the globule energy coming from the three classes. These give control over the collapse transition temperature of the resulting energy function, and also have the effect of giving a roughly even distribution of energy between proximity classes (in both native and globule states) in accordance with expectation [38].
The optimization procedure is performed iteratively; i.e., after solving for , a new set of globules is generated by molecular dynamics using E( ) as the energy function (this is the most computationally intensive part of the procedure). Reoptimization follows, and the process is repeated until convergence is reached.
After is obtained, structure predictions are made by searching for low-energy states by molecular dynamics. An annealing schedule (temperature as a function of simulation time) must be set. With a simple linear reduction in T from = 2.0 to = 0.0 in 90000 time steps, very good predictions (by current standards) are obtained for proteins outside the training set. This is detailed in the paper by Hardin et al. [32].
3. Sampling methods
Three principal cases have been investigated by the sampling method described in this section. The prediction Hamiltonian presented in the previous section has been analyzed for two different alpha-helical proteins. The first, phase 434 repressor (PDB code 1r69), is a member of the training set, and the second, HDEA (PDB code 1bg8), is neither a member of the training set nor does it have significant structural homology with any of the memory proteins. We emphasize again that the training proteins also do not have significant structural homology with any of the memory proteins. Thus, we do not expect there to be a very significant difference between the quality of the Hamiltonian for small to medium-sized alpha-helical proteins outside and inside the training set. There is of course the possibility of over-learning parameters in the optimization procedure if the training set is too small. While we do not attempt to address this issue in this paper, since clearly many more than two examples would be required, the techniques described below can straightforwardly be used in a systematic investigation. The main point in this paper is to quantitatively contrast the ab initio prediction energy landscapes of proteins 1r69 and 1bg8 with that arising for an ideal prediction scheme based on knowledge of the native structure. For this ideal scheme we use a G model [39] with the native structure of protein 1r69 designed to be its global energy minimum, as described in Appendix C.
One of the key thermodynamic quantities desired from simulation is the free energy as a function of temperature and one or more order parameters, Q. Since a protein is a finite-sized system, this could in principle be obtained from one constant-temperature simulation run (using, for example, the so-called single-histogram technique [40]). However, despite the fact that this approach can be successful in the context of protein folding (see, e.g., [41]), it is found to be inefficient for the off-lattice model studied here because the form of the free energy as a function of nativeness leads to only a very small region of phase space being explored in a reasonable simulation time. The method preferred here then is the multiple-histogram technique [42, 43], which has found widespread application in the protein folding field. As specified more precisely below, many simulations are carried out with different biasing potentials added to the Hamiltonian, each acting to constrain the protein to a chosen region of phase space. This allows efficient sampling of the funnel, while the multiple-histogram technique provides the prescription for extracting free energies (of the bare Hamiltonian) from this data.
The sampling procedure adopted is as follows: Initially, a temperature at which to sample must be chosen. The choice here is = 1, which is guided by a number of considerations. First, this is close to the folding temperature of the G model, which is found to be = 1.06. Second, it lies in the region = 0.81.0, where structures with the highest degree of nativeness are typically found on annealing runs. Finally, at = 1 the protein kinetics are sufficiently facile that the length of time between essentially independent samples is relatively small. At lower temperatures, slowing in the kinetics leads to a rapidly increasing computational burden for the prediction Hamiltonian. At = 1, ns = 20 constant-temperature simulations are then performed using the energy functions
(E is replaced with EG for the G case, of course.) The functions Vi(Q) (i = 1, 2,· · ·
, ns) are well-shaped potentials centered on different values of Q to give a good sampling of phase space along all of the reaction coordinates of interest, with care taken that adjacent simulations have overlap in the regions sampled. The preferred choice was Vi(Q) = Vi(Q) = 105 (Q Qi)4, with Qi = 0, 0.05, 0.1,
· · ·
, 0.95. The order parameter, Q, measures similarity to the native state and involves a sum over all (except nearest-neighbor) pairs of C atoms,
|
Q =
|
2

(N 1)(N 2)
|
|
 i < j 1
| exp
|
|
|
(rij rNij )2

2 2ij
|
,
|
|
(8)
|
where rNij is the C C distance between residues i and j in the native state. Q runs between 0 (completely unfolded) and 1 (native).
During each simulation, a total of Niobs = 400 samples of Q and E are performed at regularly spaced intervals of 3000 time steps. The time interval such that successive observations may be considered independent depends on the Hamiltonian under consideration. At = 1, it is approximately 30000 time steps in the case of 1r69 and 1bg8, and about 6000 time steps for the G model; the dependence of this time on the biasing functions is weak. Thus, the total number of essentially independent samples is 20 × 400/10 = 800 for 1bg8a and 1r69, and 20 × 400/2 = 4000 for the G model. A histogram Ni(E, Q) is created for each simulation, i, whence the multiple-histogram approach gives the following for the density of states, n(E, Q):
|
n(E, Q) =
|
 i
|
wi(E, Q)
|
Ni(E, Q)

Niobs
|
eßiE + ßiVi(Q)Zi(ßi).
|
|
(9)
|
Here ßi = 1/kBTi is the inverse temperature of simulation i, and the weights wi(E, Q) that minimize the error in n(E, Q) may be expressed as
|
wi(E, Q) =
|
Ai2

Aj2 j
|
, Ai2(E, Q) = n(E, Q)(Niobs)1eßiE + ßiVi(Q)Zi(ßi).
|
|
(10)
|
The partition function, Zi, is as usual given by
|
Zi =
|
 E,Q
|
n(E, Q)eßiEßiVi(Q).
|
|
(11)
|
Equation (9) for n(E, Q) and Equation (11) for Zi self-consistently determine n(E, Q) to within a multiplicative constant, and hence the free energy,
|
F(Q, T) = kBT log
|
|
 E,Q
|
n(E, Q)eE/kBT
|
,
|
|
(12)
|
to within an additive constant. Knowledge of the density of states allows straightforward calculation of the canonical energy and entropy, as well as expectation values of various observables, as a function of Q and T. Our experience is that, with proteins of a size similar to those analyzed here, the amount of sampling performed here gives good quantitative results for an extrapolation in temperature of up to roughly 10%. The temperature range may of course be extended by performing additional simulations at different temperatures.
In the case of protein 1r69, we perform an additional 20 simulations, as described above, but with the single difference that the functions {Vi(Q)} are replaced with {Vi(Qt)}, where Qt measures the similarity to a local minimum (trap) of the potential found from an annealing run. Qt is defined as Q [Equation (8)], but with the coordinates of the native state replaced by those of the trapped state.
In the results section, thermodynamic quantities are principally displayed as a function of Q, but other order parameters are investigated as well; the use of one particular order parameter in the biasing functions does not preclude calculation of free energies as a function of another, as the multiple histogram equations above make clear. Other order parameters we consider in the following section include RMSD (the root mean square deviation of the C carbons from their native positions) and a contact Q, Qc. This is defined in a way similar to Q, but with the difference that only C pairs that are closer than a cutoff distance, rc, in the native structure are included in the sum,
|
Qc =
|
i < j 1
(rc r
|
N ij
|
) exp
|
|
|
|
(rij r
|
N ij
| )2
|

2 ij2
|
|
|

i < j 1
(rc rNij)
|
|
(13)
|
We chose rc = 8 Å. Qc is thus more directly comparable to the typical order parameter used in lattice model studies (the fraction of native contacts) than is Q, which includes all C pairs in its sum. It can therefore be useful to examine the landscape using Qc as an order parameter so that a better correspondence with lattice studies can be achieved. Finally, we also define order parameters that describe the amount of order by proximity class. These are also defined analogously to Q, but with the sum restricted to run over only the |i j| appropriate to that class. For example,
|
Qshort =
|
j 5 < i < j 1
|
exp
|
|
|
|
(rij r
|
N ij
| )2
|

2 ij2
|
|
|

j 5 < i < j 1
1
|
|
(14)
|
4. Results
To give a sense of the relationship between the order parameter principally used in the thermodynamic profiles below (Q) and the commonly used RMSD order parameter, we show in Figure 1(a) the mean value of RMSD for protein 1r69 at = 1 obtained with the prediction Hamiltonian, as a function of Q. The points on the figure represent instantaneous values obtained during the course of the simulation, and are included to indicate the size of fluctuations around the mean. Similarly, Figure 1(b) shows the dependence of the mean value of the contact order parameter Qc as a function of Q. Several features apparent in the figure are worth noting. First, Qc and Q are strongly correlated over their entire range. Note that < Qc(Q)> > Q for all Q so that, for example, an ensemble of structures at Q = 0.4 (40% of residue pairs at the correct distance) corresponds roughly to one with Qc = 0.6 (60% of contacts formed). On the other hand, Q and RMSD are strongly correlated for only Q > 0.5 (roughly RMSD < 5 Å). For lower values of similarity to the native state, there is a much wider spread of RMSD found at a single value of Q. For example, at Q = 0.4 the RMSD ranges from 5 Å to 12 Å. This reflects the fact that outside the immediate vicinity of the native state, motion of (say) a protruding arm of the protein can cause little change in the number of contacts while giving rise to large fluctuations in RMSD; this is one reason that Q (or Qc) is often preferred to RMSD as an order parameter until only extremely good results are considered.
Figure 1
A rough conversion between RMSD and Q can be attempted for near-native structures, and Figure 1(a) shows that Q = 0.75 corresponds roughly to a 2-Å RMSD, while Q = 0.54 corresponds to an RMSD of 4 ± 0.5 Å. While these values and the results of Figure 1 have been obtained for the particular case of protein 1r69 at = 1.0 with the prediction Hamiltonian, the situation is found to be insensitive to temperature variation, and very similar quantitatively for protein 1bg8. The case of the G model, which is shown in Figure 2, is also broadly similar, but there are some quantitative differences. Most notably, a significant number of structures for which Q > 0.5 have a large RMSD. This is probably related to the fact that, although in all cases studied here the proteins remain mostly collapsed over the entire range of Q, the fluctuations in the radius of gyration are larger in the G model than the prediction Hamiltonian, as illustrated in Figure 3. This stems from the fact that in the G situation no non-native interactions exist to stabilize the more fully collapsed states, and again points up the fact that when judging the quality of a structure it is desirable to have several measures available, and not rely only on the RMSD parameter.
Figure 2
Figure 3
Before examining the free-energy profiles, it is helpful to simply look at a few examples of the huge number of structures encountered during the molecular dynamics sampling. In Figure 4 we show simulation structures for protein 1r69 with the prediction Hamiltonian at Q = 0.25 (top), Q = 0.4, Q = 0.55, and Q = 1.0 (bottom). At Q = 0.25, the structures are essentially random, with hardly any well-defined secondary structure; they are similar neither to each other nor to the native state. At Q = 0.4, helical secondary-structure elements are clearly visible; the structure to the right has some noticeable topological similarity with the native state. By Q = 0.55, however, the resemblance to the native state is clear. Note that while all other non-native structures were encountered at = 1 (and in the presence of a biasing potential), the Q = 0.4 structure to the left is in fact a local minimum of the (unbiased) energy function found from an annealing run. This probably accounts for the noticeable difference in helical content from the other two Q = 0.4 structures shown.
Figure 4
Figure 5 shows the free energy as a function of both Q and Qc at two different temperatures ( = 1.06 and 0.9) for the three different Hamiltonians investigated here (G model, protein 1r69 with the prediction Hamiltonian, and protein 1bg8 with the prediction Hamiltonian). The behavior of the G model is seen to be in striking contrast to that of the prediction Hamiltonian. In the G case at = 1.06 [Figure 5(a)], the free energy has a double-well structure, with a small barrier of 3.2 (3.0kBT) separating a disordered molten globule-like minimum at around Q = 0.33 (Qc = 0.41) from a native-like minimum with Q = 0.76 (Qc = 0.79). As the temperature is lowered, the globule rises in free energy relative to the native minimum (and in fact by = 0.9 the globule minimum has essentially been washed away), which in turn moves to still higher Q (Q = 0.87 or Qc = 0.9 at = 0.9) and approaches Q = 1 as the temperature is lowered further. The free-energy minima in this case are controlled by the exchange of energy for entropy. With regard to the prediction Hamiltonian, however, only a single minimum is seen over the same temperature range for proteins 1r69 and 1bg8. Measured by Q, this minimum lies roughly at the same position as the G globule minimum, but measured by Qc it is more native-like. Lowering from 1.06 to 0.9 does lead to a shift in the position of the minima (for both proteins 1r69 and 1bg8) toward the native state (measured by both Q and Qc); however, this shift is modest in comparison to the G case, and a further decrease in temperature is found not to move the minimum significantly nearer the native state. This is an indication that conflicts between different forms of energy, in contrast to a simple tradeoff between energy and entropy, are important; i.e., the prediction Hamiltonian is more frustrated than the G Hamiltonian.
Figure 5
Knowledge of the form of F(Q, T) for proteins in the training set has immediate practical benefits, notwithstanding the insight it may give into improving energy functions (which is discussed further below). In particular, it aids in the design of an appropriate annealing schedule, although for a full picture all proteins in the training set, as well as kinetic issues, should be considered. On the basis of analysis of protein 1r69, it appears that search via MD efforts should be concentrated around = 0.9. Not only does F(Q, T) indicate the expected quality of a prediction at a given temperature, but it also trivially allows an estimate to be made of the expected sampling time required before a structure of a desired quality is found. For example, at = 1.06, F(Q = 0.55) ; 13 ; 12kBT, and so the number of independent samples typically required to realize such a structure will be roughly e12 ; 105 at this temperature. At = 0.9, F(Q = 0.55) ; 9 ; 10kBT, and the number of independent samples required to see a Q = 0.55 structure is reduced to e10 ; 104. For protein 1bg8 at = 1.06, F(Q = 0.55) ; 18kBT, requiring roughly 108 independent samples, a number which in fact increases slightly at the lower temperature of = 0.9, at which F(Q = 0.55) ; 20kBT.
In order to better understand the behavior of the free energy, it is helpful to consider the energetic and entropic components separately. These are illustrated in Figures 6 and 7 respectively as a function of Q (examining the entropy and energy as functions of other order parameters is of course possible and yields essentially the same picture). Consider first the energy. In the case of the G model, the energy decreases monotonically, almost linearly as Q is raised from 0 to 1. Decreasing the temperature does not affect this behavior, for which the energy function provides a thermodynamic driving force for increasing Q whatever the value of Q; i.e., the energy function is funneled to the native state. In the case of protein 1r69, it is found [Figure 6(b)] that at = 1.06, E(Q) is monotonically decreasing; i.e., the energy function is funneled to the native state, although for Q > 0.4 the slope is less pronounced. At = 0.9, however, for Q > 0.4 the energy is essentially flat and is no longer funneled to the native state. The energy landscape in this case resembles a caldera more than a funnel. In the case of protein 1bg8 the situation is similar, but more pronounced: The landscape is funneled to Q ; 0.4, but a further increase in Q is accompanied by an increase in E(Q); lowering the temperature only acts to make this bias away from the native more severe. In this light, the behavior of the free energy is more understandable. For example, since as stated above, E(Q) is funneled down to Q = 0.4 at = 0.9, lowering the temperature below = 0.9 will not lead to a minimum in F(Q) that is more native-like than at Q = 0.4. Since the minimum in F(Q) already lies close to Q = 0.4 [Figure 5(b)], further reduction in temperature is unlikely to significantly improve F(Q) and could even lead to deterioration.
Figure 6
Figure 7
The reason that the energy landscape for the prediction Hamiltonian is caldera-like rather than funnel-like is of course due to the existence of non-native structures that lie lower in energy than the putative native structure. In order to gauge the number of structures that are energetically competitive with the native structure, it is necessary to turn to the entropy curves. It is sobering first to consider the G model. As noted above, this model has a folding transition at = 1.06, and the transition-state ensemble [defined to be the maximum in F(Q)] lies just below Q = 0.5. Thus, defining all states with Q ; 0.5 to lie in the native basin and all other states to lie in the globule minimum, we calculate the entropy difference between the native and globule basins. At Tf this is found to be 122kB [as may be approximately verified by looking at the entropy difference in Figure 7(a) between Q = 0.33 and Q = 0.76]. In other words, the globules are entropically equivalent to e122 ; 1053 native basins. With regard to the size of a native basin, we note that <RMSD(Q = 0.76, = 1.06) > = 2.7 Å. Since 1053 ; 6329, this corresponds to roughly 30 states of 2.7-Å resolution per residue. Even though these numbers of states would be reduced were the G model confined to be more tightly collapsed, they clearly emphasize the entropic mountain faced in structure prediction.
In the case of protein 1r69 with the prediction Hamiltonian, the slope of S(Q) is smaller than for the G Hamiltonian. As temperature is reduced, this slope is reduced still further. This again reflects the presence of non-native interactions; these naturally have more effect at small to intermediate values of Q (at Q = 1, of course, they are absent by definition), where they stabilize a small fraction of states which gain in thermodynamic weight as the temperature is lowered. The situation for protein 1bg8 is seen to be similar to that for protein 1r69, but it is more extreme: At = 0.9, S(Q) becomes nearly flat above Q = 0.3, strongly suggesting that the system is running out of states at this temperature; i.e., it is approaching a glass transition [4447]. To be more quantitative, we pose the question What is the (canonical) entropy difference S* between the two volumes of phase space separated by the surface Q = Q*? Given that we do not know the actual basin size in the prediction model, and that with this Hamiltonian the desired native state may not even lie in a localized basin, the choice of Q* is somewhat arbitrary. We make the choice Q* = 0.55, motivated by the fact that this corresponds to an RMSD of slightly less than 4 Å on average, which would currently be considered a very good ab initio prediction. A quick estimate of S* may be obtained from locating the minimum in F(Q) (Figure 5), then calculating the difference S* = S(Qmin) S(Q*) from Figure 7. These differences are roughly S*( = 1.06) ; 30kB and S*( = 0.9) ; 20kB for protein 1r69, and S*( = 1.06) ; 20kB and S* ( = 0.9) ; 10kB for protein 1bg8. In other words, the number of states thermally occupied in protein 1r69 decreases from 1013 4-Å basins at = 1.06 to 108 4-Å basins at = 0.9. [Note that the difference between the estimated number of thermally occupied states at = 0.9 (108 Q = 0.55 basins) and the above-estimated expected number of states to search through at this temperature before finding one with Q ; 0.55 (104) reflects the fact that E(Q) is not precisely flat at this temperature, but retains a small bias toward the native.] The corresponding drop for protein 1bg8 is roughly 108104 4-Å basins. Unfortunately, accurate calculation of these numbers beyond this temperature range is hampered by statistical noise in the calculation of the energy. However, within the range = 1.10.9, the decrease in entropy is found to be nearly linear, so a thermodynamic glass transition in the region = 0.60.8 appears likely.
Further investigation of the onset of glassy behavior is facilitated by considering order parameters other than those quantifying similarity to the native state. For example, Figure 8 shows the probability distribution P(q) of overlaps q obtained between members of the equilibrium ensemble at two different temperatures for the three Hamiltonians studied. Although, naturally, the G model is the antithesis of glassy behavior, it is nonetheless helpful to consider P(q) for this case first; see Figure 8(a). At the G folding temperature ( = 1.06), P(q) has two distinct peaks: one at q ; 0.31 corresponding to the globule minimum, and one at q ; 0.66 corresponding to the native basin (the structures with strong similarity to a particular structurein this case the nativeare also similar to one another). At the lower temperature = 0.9, only the native-like peak survives, and it moves to higher q, since at this temperature the native basin is more tightly focused on the native structure [see Figure 5(a[i])] and the equilibrium structures are consequently more similar to one another. This situation contrasts sharply with that found for the prediction Hamiltonian. Here, the maximum in P(q) moves only from q = 0.34 to q = 0.41 (1r69) and q = 0.36 to q = 0.44 (1bg8a) between = 1.06 and 0.9. This movement reflects the loss in entropy of the ensemble as the temperature is lowered; however, in this temperature range there is no compelling evidence of the existence of a small number of strongly localized basins (traps); i.e., there is no second peak at large values of q. There could be several reasons for the absence of such a signal. Most obviously, the temperature range investigated could still be above the kinetic glass transition temperature (TA) where distinct traps separated by free-energy barriers form [44, 48]. Alternatively, the system could be below TA, but with the globule basin still thermodynamically dominant because of its larger entropy. Also, problems with sampling can never be entirely ruled out: Free-energy barriers may make some basins kinetically inaccessible on the simulation time scale, or there could simply be so many traps that the probability of return to any one is low on the simulation time scale. In this regard, note that the presence of a large number of dissimilar traps will also reduce the expected magnitude of any second peak in P(q), which is approximately inversely proportional to the number of traps.
Figure 8
It is interesting to probe an individual trap in detail. Figure 9 shows the free energy, energy, and entropy as a function of similarity Qt to a local minimum of the energy function foundas discussed in the previous sectionfrom an annealing run with the prediction Hamiltonian for protein 1r69. Figure 9(a) indicates that in the temperature range studied, there is only a single minimum in F(Qt) (lying close to Qt = 0.4). Thus, the local minimum found from annealing is not the center of a localized basin at = 0.9. However, it is clear that at some lower temperature this situation must change, since the state is a local minimum of the energy function, but not a unique one. There are some indications from Figure 9 that at a temperature not far below = 0.9, F(Qt) will develop a double-minimum structure. In particular, in comparison with F(Q) [Figure 5(b[i])], F(Qt) is significantly flatter in the region 0.6 < Qt < 0.8, and becomes flatter with decreasing temperature. Furthermore, E(Qt) [Figure 9(b)] remains funneled at = 0.9, indicating that below this temperature the trend in F is continued (i.e., high-Qt states are reduced in free energy relative to those at low Qt); and that a second minimum may appear at high Qt. The trend in F suggests that this occurs at roughly = 0.70.8. Indeed, direct calculation of (Qt) at = 0.7, although an extrapolation of 30% in temperature and outside the range for which the multiple histogram method is quantitatively reliable, does indeed suggest a double-minimum structure for F(Qt), with a second minimum occurring in the range Qt = 0.70.8. Note that this implies a basin size somewhat smaller than the Q = 0.55 used to estimate the thermodynamic glass transition temperature above; if it is typical, that estimated temperature should be revised downward a little. Of course, all traps will to some extent have different F(Qt), and analysis of a single one does not necessarily give the full picture. However, the main point of our analysis here is to illustrate techniques that may be used to investigate trapped states, rather than to provide an exhaustive study of a single Hamiltonian.
Figure 9
We now consider how to identify where in the energy function improvements might be made. In this regard, it is often useful to consider the contribution of different components of the energy to the slope of the funnel. A natural first step is to consider separately the associative memory and backbone contributions. In Figure 10 the backbone energy is shown as a function of Q for two different temperatures for the three Hamiltonians considered here, and in addition it is shown as a function of similarity to the trapped state just discussed. For the case of protein 1r69 [Figure 10(b)], the backbone provides a significant contribution to the funneling, roughly 35 between Q = 0.3 and Q = 0.8 at = 1.06, and 25 at = 0.9. Unsurprisingly, this is similar to the amount of backbone funneling provided in the G case [Figure 10(a)]. The slope in the case of protein 1bg8a is less marked, and almost disappears at = 0.9. Further analysis (not illustrated) shows that the major contribution to the variation of Eback with Q is from the Ramachandran potential. Figure 10 thus immediately implies that an increase in the weight of the Ramachandran potential will lead to a shift in the minimum of F(Q) toward the native state for protein 1r69. Increasing the weight of this term will lead to a depopulation of those levels with unfavorable angles, which lie at intermediate values of Q; i.e., the entropy of the globule basin will be reduced. There is naturally a limit beyond which further increase of the Ramachandran weight will not improve prediction results, as would be indicated by a flattening out of a plot of Ramachandran energy against Q (much as seen for protein 1bg8 at = 0.9). Further increase of the weight of an unbiased Ramachandran energy in the Hamiltonian will lead to emphasized glassy behavior with basins centered on relatively few states with near-perfect angles. [It is worth noting that the trapped state found with the current Hamiltonian already has a slightly more funneled Eback(Q) plot than the native; see Figure 10(d).] Therefore, the weight of the backbone should be changed cautiously, after considering all of the proteins in the training set as well as any kinetic consequences of such a change. The use of secondary-structure predictions to produce a Ramachandran potential biased to predicted secondary structure could also help.
Figure 10
Figure 11 shows the combined contribution of associative memory and contact (i.e., non-backbone) terms to the energy as a function of Q. The behavior of Eamc(Q) largely mirrors that of E(Q) shown in Figure 6. In particular, Eamc(Q) is funneled to the native state in the G case, and also, though less strongly, for protein 1r69 at = 1.06, while at = 0.9, Eamc(Q) is flat beyond about Q = 0.4 for protein 1r69, and slopes away from the native state for protein 1bg8a in the temperature range shown. Thus, at = 0.9, the weak residual funneling beyond Q = 0.4, apparent in E(Q) for protein 1r69 [Figure 6(b)], arises entirely from the backbone contribution (Figure 10) as opposed to Eamc. Note also that the difference between Eamc as a function of similarity to the native and Eamc as a function of similarity to the trapped state is a late funneling worth about 20 between Qt = 0.5 and 0.9 [compare Figures 11(b) and 11(d)]. This is much greater than the corresponding difference arising from backbone contributions, indicating that the trapped state is largely stabilized by associative memory or contact contributions.
Figure 11
It would naturally be desirable to improve the quality of the non-backbone terms. To investigate them in more detail, we break up their contributions according to sequence separation. Specifically, Eamc is separated into three terms {Ex} for which x = short, medium, or long, corresponding to the three proximity classes of the Hamiltonian. These are shown as a function of Q in Figure 12. The individual components of Eamc are found to follow trends similar to those of Eamc (or even the total energy E). For example, all components {Ex} contribute similarly to the funneling in the G model, which is funneled down to the native state [Figure 12(a)], while in the prediction case the funneling is weaker or absent after a certain value of Q [Figures 12(b, c)]. However, it is more profitable to focus on the difference between the different classes of interaction. For example, Figure 12(b) shows that the contribution to the funneling for protein 1r69 above Q = 0.4 comes almost entirely from the short-range interactions at = 1.06; at = 0.9, there remains weak funneling in this class of interactions, while the other two classes give a small funneling away from the native state [leading to a flat total Eamc(Q > 0.4); see Figure 11(b)]. This situation is essentially repeated for protein 1bg8a, with the contribution to funneling being largest for Eshort, followed by that for Emedium. This implies that a reweighting of the overall contributions of short-, medium-, and long-range interactions to increase the weight of short-range ones will move the minimum in F(Q) toward the native state (at least in the temperature range of Figure 12). The balance among short-, medium-, and long-range interactions is optimal when all three curves level off at the same temperature.
Figure 12
While Figure 12 aids assessment of the balance among the various terms, it is difficult to use it alone to assess the relative merits of the energy function used in the three classes. The additional information required is a measure of the amount of ordering by class, i.e., a knowledge of the thermodynamics as a function of {Qx}. For example, the main difference between {Ex} as a function of Q and as a function of Qt is an additional funneling that occurs near the trapped state in the short- and medium-range classes of interactions. This does not imply that there is a flaw in these particular classes, especially given that the trap has Qshort = 0.8, Qmedium = 0.56, but Qlong = 0.29. A more plausible explanation is that improvement in the long-range potential would help discriminate more against such traps. Another example for which {Qx} information is required is shown in Figure 12(b[i]). Here, the caldera-like nature of the long- and medium-range classes compared to the funneled short-range energy could in principle be explained by a scenario in which the medium-to-long-range potential is so good that Qmedium and Qlong are close to unity for all structures encountered in the simulation. That this is not in fact the case can be seen in Figure 13 [part (b[i]) in particular], in which the total free energy of the proteins is shown as a function of Qx. The minimum of F(Qx) for protein 1r69 at = 1.06 lies at 0.62 (short), 0.44 (medium), and 0.27 (long). Thus, not only is the contribution of the short-range potential to the overall funnel largest in this case, but the short-range native order is at the same time greater than that in the other two classes. When the temperature is reduced to = 0.9, the minima in F(Qshort) move further toward the native state than in the other two classes. Protein 1bg8a shows behavior similar to that of 1r69 (and there is also no significant difference using similarity to the trapped state by class as order parameters). The short-range native order for protein 1r69 is comparable to that of G [compare Figures 13(b[ii]) and (a[ii])], while the native order in the medium-range class is considerably less than that in the G folded minimum, but is more than in the G globules. The order in the long-range class is similar to that in the G globules.
Figure 13
The above suggests that improvements should be directed at the long-range (contact) potential, which does not appear to provide sufficient discrimination. This is also indicated by Figure 14, in which the components of the energy, Ex, are plotted as functions of their corresponding Qx values. This method of displaying the data is probably the clearest way of comparing the quality of different terms, although, as discussed above, that used in Figure 12 makes possible a more ready assessment of the balance among terms. It is immediately apparent from Figure 14(b) that the funneling in the short-range class is excellent (to Qshort = 1), while in the medium range it is good (to Qmedium = 0.7 at = 1.06 and to Qmedium = 0.55 at = 0.9); however, in the long-range class it is poor (weakly to Qlong = 0.4 at = 1.06, and not beyond Qlong = 0.2 at = 0.9). The situation with regard to the long-range class is somewhat better for protein 1bg8, but the discrimination in this class is clearly worse than in the other two classes. This is not very surprising, since there exist good local signals for secondary structure, so design of good potentials is easier in the short-range (local) as opposed to long-range classes. The discrimination achieved in the medium-range class is probably responsible for the relative success of the prediction Hamiltonian.
Figure 14
Although we have concentrated on describing how changes of the potential affect free-energy profiles and funnel characteristics, free-energy profiling can be used to investigate other aspects of prediction schemes. We end with an example using a consensus sequence scheme. One might argue that fluctuations in sequence away from a consensus are tolerated and are either evolutionally neutral or represent adaptations for function. Thus, a consensus sequence might have a more pronounced funnel. To study this we prepared a consensus sequence for the training protein 1r69. This was done by identifying the 14 most similar sequences via a BLAST search [49], and then identifying the most common class of residue at each position in the sequence after performing a multiple-sequence alignment. This is a simplified version of the multiple-sequence averaging scheme of Keasar et al. [25, 26]. In our four-letter code (H = hydrophobic, P = hydrophilic, A = acidic, B = basic), the actual protein 1r69 sequence and the consensus sequence are, respectively, PHPPBHBPBBHAHPHAAPAHPABHPPPAAPHAAHAAPBPBBPBHHPAHPPPHPHP- HAHHHAPP and PHPABHBABBPAHPHPAPAHPABHPHPAAPHAAHAAPBPBPPBH- HHAHPPPHPHPPAHHHAPP. The eight positions where the sequences differ are shown in bold. The free energy and the energy of the consensus sequence were calculated in the key range Q = 0.2 to Q = 0.6 and are compared to the results for protein 1r69 in Figures 15(a) and 15(b), respectively. There is a clear but modest shift in the position of the free-energy minimum from Q = 0.36 to Q = 0.39. This is accompanied by a greater funneling in the medium-Q range (Q = 0.30.5). There is also evidence that the energy function in the long-range proximity class, in particular, has better discrimination when the consensus sequence is used; see Figure 15(c). This is despite the fact that it was the protein 1r69 sequence, and not the consensus sequence, that was used in the optimization procedure. Thus, while the caldera picture still applies and the resolution of predictions is limited, using the consensus sequence is seen to have a small beneficial effect in smoothing out the energy landscape. More sophisticated ways of using multiple sequence information such as that described by Keasar et al. may have a greater effect.
Figure 15
5. Concluding remarks
The protein structure prediction problem is essentially one of finding an energy function with a funnel-shaped landscape for naturally occurring protein sequences. Additionally, to make the problem computationally feasible it is naturally desirable that the energy function be as simple as possible; i.e., that a reduced description of the protein be used, omitting explicit solvent molecules and possibly some side-chain atoms, and containing few expensive many-body interactions. It is not obvious that such a reduced description can in fact produce a funnel-shaped landscape in the absence of homology information, even for a limited subset of proteins (such as alpha-helical proteins). However, while current ab initio prediction schemes are far from perfect, the best are certainly better than random, indicating that their landscapes are funneled to some extent.
In this paper, we have shown, using well-known methods, how the shape of the energy landscape may be quantified for a structure-prediction Hamiltonian. While the method was illustrated for an associative-memory Hamiltonian with no homology information, it is generally applicable to other energy functions. There are two main reasons to analyze an energy function as described here. First is the wealth of information obtainable from such an analysis, allowing the quality of the energy function to be objectively quantified according to a number of measures. For example, knowledge of F(Q, T) trivially allows estimation of the amount of sampling time required to find a structure of a specified quality; by measuring the free energy as a function of similarity to trapped states, estimates of the kinetic glass transition temperature and basin size can be made. Knowledge of S(Q, T) gives the total number of thermally occupied states as a function of temperature, and leads to an estimate of the thermodynamic glass transition temperature. Perhaps most significantly, E(Q, T) gives a direct and quantitative measure of the shape of the landscape. For an ideal energy function, E(Q, T) is funneled down to the native state at all temperatures, and so the position of the global free-energy minimum is controlled by the exchange of energy for entropy: At low enough temperature, a good prediction will be obtained. For the prediction energy function, E(Q, T) is funneled to the native state at high temperature (where folding is entropically prevented), but when the temperature is reduced, the slope in the funnel decreases until a situation is reached in which the energy landscape resembles a caldera; i.e., it is only funneled up to a certain value of nativeness, beyond which it is flat. This reflects the presence of competing (non-native) interactions; i.e., folding is energetically prevented by the presence of low-energy non-native structures (traps).
The information from thermodynamic analysis appears to come at a large computational cost: For the example studied here, the analysis requires 30 times the computing time of a single prediction. This cost may be reduced by focusing on a smaller region of phase space, dictated in part by expectations for the energy function; if, for example, a prediction of nativeness of Q = 0.55 is required, there is much to be learned just from studying the thermodynamics in the range Q = 0.3 to Q = 0.55. Additionally, the comparison of computing time with that of a single prediction is not a fair one, since for a partially funneled landscape the quality of predictions will vary between individual annealing runs, and several are required to gauge the quality of the energy function. While the thermodynamic analysis can never completely replace judging the quality of the predictions as a means of evaluation (the predictions are, after all, the final product), there is a second key reason to perform it. That is, analysis of the thermodynamics as a function of various order parameters points the way to where improvements can and should be made to the prediction scheme. On a straightforward level, it can guide the design of an annealing schedule. Examining components of the energy as a function of overall nativeness allows the balance between various terms to be assessed and gives an indication how they should be reweighted. Examining components of the energy as a function of the corresponding nativeness [e.g., Eshort(Qshort)], gives a better indication of which components individually require improvement. Once the optimum parameterization has been obtained for a given energy function, further improvement requires refining the form of the energy function. Thermodynamic knowledge can in principle be used to help design two-stage prediction schemes in which a reduced description is used to quickly narrow the search problem to a relatively small basin, after which more detail (e.g., extra atoms or many-body interactions) can be switched on to narrow the ensemble of predicted structures still further.
Although we have not discussed it here, the use of sampled structures with a range of nativeness, combined with a knowledge of the E(Q) and F(Q) in that range, also opens up the possibility of alternative optimization criteria. A simple example is to maximize the stability gap between molten globule structures and an ensemble of structures constrained to lie within, say, 4 Å RMSD of the native state (rather than the native state itself), with the objective of increasing the funneled character of the landscape in the region where it is most needed. Such schemes could potentially be enhanced if combined with the histogram analysis, which allows calculation of energy and free energy as a function of the parameters in the energy function without additional molecular dynamics simulations. We hope to return to these issues in the near future.
Appendix A: Backbone Hamiltonian
The same backbone Hamiltonian as used here, with minor differences, has been discussed more expansively elsewhere; see for example [35]. For completeness, the main points of the backbone energy function used in this paper are summarized below. Eback is composed of several terms:
|
Eback=ESHAKE + Eev + Echain + Echi + Erama.
| (A1) |
Each term depends only on the positions of C , Cß, and backbone O atoms (all assumed to have unit mass), which are thus the only atoms to enter the dynamics. It aids visualization, however, to express some of the forces between these atoms in terms of the variables
rNi = 0.483rC i1 + 0.703rCi 0.186rOi1
| (A2)
|
and
rCi = 0.444rCi + 0.235rC i+1 0.321rOi,
| (A3)
|
where, in an obvious notation, rNi and rCi are the positions the nitrogen and C' carbons of the protein backbone would respectively assume, given ideal amino acid geometry.
The chain connectivity is maintained by the SHAKE algorithm [50], which constrains the neighboring C C distance, as well as the C Cß bond and distances from the oxygens to the neighboring two C atoms at their ideal values. Excluded volume effects are included via a harmonic interaction between C C , C Cß, CßCß, and OO pairs of atoms separated by less than rev:
Eev =  Cev
|
 x,y
|
 i < j
|
[revC(ji) rCixCjy][revC(j i) rCixCjy]2
|
|
|
+  Oev
|
 i < j
|
(revOrOiOj)(rOev rOiOj)2,
|
|
(A4)
|
where x and y can each take the values and ß. Note that rCev has a sequence dependence; its actual values are given below. The remaining terms act to maintain chain geometry close to that of an ideal peptide chain (with some flexibility). First, the correct bond angles at C are maintained by a combination of the SHAKE algorithm and harmonic potentials of the form
Echain =  chain
|
 i
|
{( NiCiß 2.46)2
+ ( CiCiß 2.51)2
+ ( NiCi 2.45)2}.
|
|
(A5)
|
Second, chirality at the C atoms is maintained using the potential
where
i = ( CiCiß × Ci Ni) Ci Ciß
|
(A7)
|
and 0 = 0.83. Finally, there is a term designed to produce a distribution of dihedral ( , ) angles roughly reflecting that found in real proteins, and commonly displayed in the form of Ramachandran plots,
This term has been somewhat modified from that given in Reference [35] to better fit the backbone torsional angles observed in the standard Ramachandran map for non-glycine residues [51]. Nevertheless, the Ramachandran potential used here is similar to that used previously in that barriers between the minima are deliberately constructed to be lower than in reality in order to promote more facile chain dynamics.
The parameters chosen are CEV = 15.0, OEV = 15.0, chain = 30.0, rama = 1.0, chi = 20.0, rCev(j i < 5) = 3.5 Å, rCev(j i ; 5) = 4.5 Å, and rOev = 3.5 Å.
Finally, we note that while the backbone given above was used in generating structures for the optimization procedure outlined in Section 2, the backbone used in the thermodynamic analysis described in the results section in addition included a term dependent on the radius of gyration of the protein, Rg (calculated from C positions). This is given by a quadratic well centered at Rgpred(N) = 2.2N0.38 (which is an estimate of the radius of gyration of single-domain proteins [52]). Specifically,
Eradius =  radius( g gpred)2,
| (A8)
|
for 0.75 < Rg/Rgpred < 1.5, and Eradius is constant outside this range. We take radius = 10.0. This term has little effect on the thermodynamics for the prediction Hamiltonian, and is not required for collapse in this case. It is merely included to facilitate collapse in the case of the G model, which otherwise does not collapse before folding.
Appendix B: Training proteins and memories
The PDB codes of the ten training proteins are 1r69, 1utg, 3icb, 256b, 4cpv, 1ccr, 2mhr, 1mba, 2fha, and 1rgp. The 36 memory proteins are 1jhg, 256b, 1bgf, 5icb, 1ah7, 2a0b, 1tx4, 1avs, 1c3d, 1a28, 1ak0, 2abk, 1ail, 1lis, 1b4f, 1pbv, 1huw, 1lki, 2lbd, 1vin, 1aa7, 1bja, 1nsg, 1beo, 1au1, 1rcb, 1e2a, 1b10, 1hiw, 1col, 1szt, 1hul, 1a17, 1axd, 1baj, and 1kxu. For training protein 3icb, memory proteins 5icb and 1avs are replaced with 1cf7 and 1aep; for training protein 1rgp, memory protein 1tx4 is replaced with 1cf7. This ensures that Q for training proteins aligned with memories is Q < 0.4 in all cases, and typically Q ; 0.2. Similarly for protein 1bg8, examined in this paper as an unknown protein outside the training set, memories 1a28 and 1a17 are replaced by 1cf7 and 1aep.
Appendix C: G Hamiltonian
The G model used in this paper is defined by
EG = E
|
AM G
| + Eback.
|
|
(A9)
|
The backbone part, Eback, is identical to that used in the prediction Hamiltonian and described in Appendix A. The other term is an associative memory term with its minimum at the native structure of protein 1r69,
|
E
|
AM G
|
=
|
|
|
G [x(|ij|)]
exp
|
|
|
(rijrNij)2
|
|
.
|
|
 |  |
aG
|
2 ij2
|
i j 3
|
|
|
 |
|
 |
|
(A10)
|
The sum over i and j runs over all unique pairs of atoms (C C , C Cß, CßC , CßCß) with sequence separation of at least three residues. The interaction between C (i) and Cß(j) atoms is thus a Gaussian well centered at their native separation rijN. The well widths are given by the same formula as in the prediction Hamiltonian, ij = |i j|0.15 Å. The weights G (x) given to interactions in each proximity class are in the ratio 4.9:1.35:1.0 (short:medium:long). This is chosen so that energy is evenly distributed among the three proximity classes, as was also the case for the prediction Hamiltonian. The dimensionless constant aG is chosen so that Equation (2) is satisfied, which ensures that the energy difference between the PDB structure and a completely unfolded st |