|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.492.0475 | Copyright info |  |
 |
 |
Overview of molecular dynamics techniques and early scientific results from the Blue Gene project
|  |  |
by F. Suits, M. C. Pitman, J. W. Pitera, W. C. Swope, and R. S. Germain |
|
|  |
 |  |  |
|
| |
|
The Blue Gene* project was originally conceived in 1999 with the goal of building a petaflops computer to address the grand challenge problem of protein folding [1]. Since then, the machine design has evolved, as have the scientific objectives, but the goal of gaining insight into protein science and the mechanisms behind protein folding remains. We chose classical molecular dynamics as the method for simulating protein systems and designed a software framework—called Blue Matter—for doing these simulations on highly parallel machines consisting of thousands of nodes [2].
It is important to emphasize that the Blue Gene project involves not only the development of a powerful, highly parallel computer, but the design of software and mathematical approaches [3] that take advantage of this power, and the application of the software to do research in protein science. Since the Blue Gene/L (BG/L) hardware has been available for production only within the last year preceding this writing, this paper includes scientific results obtained on traditional IBM computers and describes some of the techniques for validating the software as it was developed. In short, a great deal of BG/L science was done prior to the arrival of the computer hardware, both in the development of a highly parallel application framework and in molecular simulations performed on conventional hardware. In this paper, we describe the progression of experiments performed with the Blue Matter framework, with an emphasis on the analysis applied to the results. The intent is to provide general insight into the range of studies performed, without requiring a detailed protein science background.
| |
|
Biomolecular systems, which comprise one or more molecules of biological interest surrounded by some amount of solvent (e.g., a protein in water), can be studied by a variety of computational methods. One of these is molecular dynamics, which simulates the movement of all of the particles of a molecular system by iteratively solving Newton's equations of motion. This calculation is based on the instantaneous coordinates of all of the particles of the system to evaluate their energies and forces of interaction. Given the coordinates, velocities, and resulting forces, one can compute the coordinates and velocities that the particles would have a short time later. This process is then repeated many times, yielding the motion of all of the particles over the time of the simulation. The small timesteps are usually of the order of one femtosecond (10−15 s), and a typical simulation might perform ten million of these steps, resulting in a simulation of 10 nanoseconds (10−9 s). The maximum size of the small timesteps is limited by the fastest motions in the system, which, for biomolecules, corresponds to the vibrations between chemically bonded atoms.
Molecular dynamics allows one to monitor the simulated system as it moves from one conformational state to the next and to deduce the timescales for those transitions compared with experimental data [4]. In addition, a number of variations on molecular dynamics allow one to also simulate the behavior of a molecular system under different conditions of temperature, pressure, and other parameters [5]. These variations allow the simulation of biomolecules in environments that correspond to typical experimental conditions. Under these conditions, properties such as the relative populations of two conformations can be determined from the simulation and compared directly with experimental data. A drawback of these variations is that they alter the temporal evolution of the simulated system, so they cannot be used to directly compare time-dependent phenomena between simulations and experiments. Two important forms of molecular dynamic simulations are thermodynamic and kinetic. Both techniques are important and provide complementary insight into protein structure and dynamics, and both are applied in the Blue Gene science program.
Thermodynamic questions involve the structure and stability of conformational states of proteins. To carry out their biological function, most proteins must be in a specific folded state. This state is generally their “native state,” the most thermodynamically stable conformation for the protein under biologically relevant conditions of temperature, pressure, and pH. As a result, thermodynamic studies of protein folding attempt to answer questions such as the following:
-
What is the most stable structure for the protein at a given temperature?
-
Is there only one stable structure, or are there multiple ones with comparable stability?
-
How do the populations of these stable states change with temperature?
Experimental techniques, such as X-ray crystallography and nuclear magnetic resonance spectroscopy, provide a partial picture of the folding thermodynamics, which can be complemented by detailed simulations [6]. By understanding what governs the stable states of proteins, we will have the potential to design proteins of novel structure [7] and function [8].
Kinetic questions are concerned with time-dependent phenomena: specifically, the questions of how proteins fold so quickly, how much time they spend in intermediate states along the folding pathway, and what interactions govern the overall rate of folding. The first question arises from the Levinthal paradox [9]—there are an astronomical number of possible conformations for a flexible protein chain [a chain of m residues could have O(10m) distinct conformations]—but only a few correspond to the folded state. If the folding process involved some sort of random motion through this space of conformations until the folded state was “found,” proteins would fold much more slowly than observed. Therefore, specific chemical interactions, along with topological constraints, must guide the protein to fold faster than it could by a simple random search. As it folds, the time each protein spends in intermediate states is also important. Intermediate states may be partially folded or misfolded and hence prone to degradation [10] or aggregation [11], both of which can affect the ability of the protein to function, or which may themselves trigger disease [12]. Finally, a full understanding of kinetic phenomena in protein folding must enable us to alter those phenomena in a deliberate fashion. By determining which specific interactions control the overall folding rate, we can design proteins that fold more rapidly or more slowly by making specific mutations. Rapidly folding protein variants could have therapeutic uses [13] or serve as functional or structural nanomaterials.
| |
|
The first step in analyzing results from a simulation is the retrieval of data from the running program. The traditional view of a computer simulation writing values to a file becomes more complex when that computer contains thousands of independent processors working together on the same problem. To provide a more scalable solution that minimizes the need for synchronized, cooperative communication among the nodes, Blue Matter reports simulation results from individual nodes via binary packets sent over sockets. This allows individual nodes to report only their contribution to simulation observables, such as total energy, and relies on external analysis routines to organize and integrate the packets to determine the full simulation state over time. Although there are many observables to study and many modes of analysis, they all begin with the examination of a raw datagram stream containing a sequence of packets representing partial quantities from the simulation. Note that the frequency of the output for various quantities can be set so that it is appropriate to the need. For example, energy terms are usually output much more frequently than the full positions and velocities of the atoms, which represent a much larger volume of data.
Numerous forms of analysis can be performed on the output of a molecular simulation, but many of them represent a reduction of a large amount of information to a simpler form that is easier to interpret. Sometimes this process involves a visualization component based on two-dimensional (2D) or 3D representation; at other times, it is a strictly quantitative reduction of a value that can be matched to experiment. One of the simplest 3D representations of a system is a direct visualization of the configuration, which can be useful both as a check that the system is behaving properly and to provide insight into dynamic processes occurring either within or between molecules. There are many examples of familiar single-value reductions of a molecular system, such as total energy, temperature, and pressure, but there are additional quantities that are specific to a system, such as the number of native hydrogen bonds. Each of these reduced forms can then be studied over time, allowing additional analysis on the time series to be performed, such as autocorrelations. Specific examples of these reduction and visualization techniques are described in more detail in the sections that follow.
With the advent of very-large-scale cellular computer systems such as BG/L, there is the prospect of producing massive volumes of data. Molecular dynamics simulations on a 512-node BG/L partition currently generate approximately 6 GB/day of data (2.2 TB/year) when taking a snapshot of the simulation state every picosecond of simulation time. On a 64-rack system, the extrapolated volume of data might be 300 TB/year, assuming linear scaling of the data volume because multiple simulations could be running on the 64-rack system. There are also significant volumes of data generated by analysis, and there may be simulations that require more frequent sampling of the simulation data. Much of the data generated is “reference” data that is accessed for analysis and then put down without any further access. Although access to the raw data is unlikely to be needed, there is an understandable reluctance to actually discard it. Instead, we have adopted a hierarchical storage approach using Tivoli* Space Manager that migrates data to tape according to user-defined policies.
| |
|
Analysis is useful not only for scientific insight into the results of a simulation, but to help establish its validity and scientific correctness [14]. One example of this is the demonstration that the integrator, which determines the motion of the atoms on the basis of applied forces, is faithfully representing the forces and energies involved. This appears most directly in the conservation of energy in the system, but also in more subtle ways that can be recognized by running a simulation at different timestep sizes.
A molecular system under simulation usually has a conserved quantity, related to its total energy, that should not change over time.1 Energy may change from potential to kinetic, allowing the system to undergo significant changes in its configuration and behavior, but the conserved quantity should remain constant. In actual simulations, the conserved quantity changes slightly on each timestep because of the finite timestep of the calculation, but there should be little or no drift over time. Thus, one of the first analysis tests for the validity of a simulation is to confirm that the conserved quantity is not drifting excessively. An additional test of the validity of the integrator involves running multiple short runs of the same simulation over the same time interval, but with a series of shorter timesteps. For short simulations with small timesteps, the change in energy over time should increase quadratically with timestep in a manner that is very sensitive to integration errors [15].
Figure 1(a) shows a plot of the conserved quantity during a short molecular simulation, along with its constituent energy terms. On this scale, the conserved quantity appears constant, and fluctuations of the individual terms appear to cancel out to keep the sum constant. Figure 1(b) shows a closeup of a series of conserved quantity plots for the same simulation, but at different timesteps, showing that the conserved quantity does vary slightly, and that the fluctuations increase with increasing timestep. Note that the vertical scale is greatly magnified to reveal fluctuations that are small on the scale of the actual conserved quantity. In this view, the curves appear roughly the same shape, but the quadratically increasing scale is not evident. Figure 1(c) shows the same plots, but with the vertical scale decreasing quadratically with the timestep size. This view allows a direct comparison of the behavior of the simulations over time, and the quadratic relationship is evident. There is a slight departure of the curves at later time due to the slightly different trajectory taken by the simulations, but this is a small effect on short simulations.
Figure 1
A more complex simulation involving constant temperature and pressure appears in Figure 2(a), which shows abrupt changes in conserved quantity that appear to scale with timestep, although quadratic behavior is not directly evident.2 When each plot is scaled quadratically with timestep, as shown in Figure 2(b), the departure is immediately evident and linked to the code related to temperature control. Although this test does not guarantee that the simulation is working correctly, it is a sensitive test of subtle errors that would otherwise be hard to find, and it serves as a necessary, though not sufficient, sign of a correctly implemented algorithm.
Figure 2
The end result of these and many other validation procedures appears in Figure 2(c), which is a plot of energy over time for a 20K-atom simulation of lipid molecules. The ability to do extended runs of large systems without significant drift of the conserved quantity provides confidence in the simulation results and allows long runs with constant energy to yield results that provide useful kinetic information, as described in the next section.
| |
|
As described in the Introduction, two common but very different ways to study a protein are via thermodynamics and kinetics. Established methods exist to generate ensembles of proteins at a given temperature on the basis of molecular dynamics [16], and the resulting distribution of conformations represents the “canonical ensemble” at that temperature. Kinetic information tends to be more elusive, largely because the timescales of protein processes are so much longer (microseconds) compared with the timestep of the simulation (femtoseconds)—a factor of one billion.
As our first experiment with Blue Matter molecular dynamics, we chose a novel approach to determine kinetic information about a small protein by running many short, independent simulations in parallel in order to piece together the behavior of a typical protein over a long period of time. The technique involved starting a number of independent runs (237) of a protein from a wide range of initial configurations and tracking their changing shape over time. Different configurations were divided into bins, and the time spent in each bin, along with the probability of transitioning to other bins, allowed a Markov model of the folding process that, in theory, could predict the folding rate and the equilibrium (canonical) ensemble.
The protein in the study was the beta-hairpin from the C-terminus of protein G, which is a 16-residue peptide commonly used in protein simulations because of its rich folding behavior despite its small size [17]. An early step in analyzing a protein simulation is to determine observables that characterize the state of the protein with regard to its degree of foldedness. Two common observables are the number of hydrogen bonds formed and the radius of gyration, but many others can be defined [18]. Hydrogen bonds are part of what gives a protein its shape, and the folded configuration consists of a number of native bonds. An arbitrary protein configuration may consist of some native bonds and other non-native bonds that may be obstructing or assisting the path to the folded configuration. The radius of gyration captures the extent of the protein, with a small value indicating a compact structure. For the hairpin, the number of native bonds and the radius of gyration together define a 2D space into which a given protein configuration can be plotted.
Figure 3(a) is a rendering of a folded configuration of the beta-hairpin created with the Prototype Protein Viewer [19]—the visualization component of the Blue Matter framework. This is a standard “stick” view of the protein, but with a novel representation of the hydrogen bonds as striped tubes with thickness proportional to their energies. This is a relatively direct view of the molecule layout and provides precise location of the individual atoms, but it does not convey the underlying forces responsible for the shape, except for the hydrogen bonds. In contrast, Figure 3(b) shows the ribbon shape of the peptide pioneered by Richardson [20], along with novel representations of other structural components, such as salt bridges and sidechain–sidechain contacts. This is a reduced, abstract view of an otherwise opaque arrangement of atoms; it conveys in one image many of the forces dictating the shape of the protein and their spatial relationship. These representations were very helpful in interpreting the many protein simulations performed for this kinetics study.
Figure 3
If one selects thousands of hairpin configurations randomly from an ensemble at a given temperature, a distribution appears as shown in Figure 4(a). When this scatterplot is instead shown as a log histogram, it represents the “free-energy surface” at the given temperature and depicts the relative distribution of configurations in the canonical ensemble. Figure 4(b) depicts the resulting histogram as a surface in three dimensions, with height proportional to the log of the probability density. In each figure, the strong vertical bands are due to the successive numbers of native hydrogen bonds, and the L-shaped appearance captures extended, nonbonded configurations in the upper left and collapsed, fully folded configurations in the lower right. Note that the distribution of configurations indicates that the hairpins do not simply fold and remain folded, but instead are constantly changing from one configuration to another. However, this thermodynamic picture tells us nothing about how a given hairpin moves about this landscape. That perspective requires a kinetics study.
Figure 4
For the kinetics experiment, we selected 237 starting configurations distributed around the free-energy landscape and launched independent simulations using Blue Matter on an IBM SP* computer. Each simulation involved 256 protein atoms, 1,660 water molecules, and three counter-ions to balance the charge in the system.3 (It is not uncommon for protein–water simulations to be dominated by the computational costs of the water molecules). As each of the protein systems evolved in time, its trajectory on the free-energy landscape captured the folding and unfolding process of an individual protein. Figure 5 is a unique visualization that combines the kinetic data from the independent trajectories with the thermodynamic data from the canonical ensemble. By running all systems in parallel and starting them distributed around the free-energy surface, we were able to track the behavior of individual proteins in different sections of the landscape at the same time. This embarrassingly parallel technique was well suited to an IBM SP computer rather than a tightly coupled shared memory machine, since the different simulations were independent and did not communicate with one another while they were running. This is in contrast to other simulations that require a large number of atoms and many processors in parallel working on a single molecular system.
Figure 5
After running each of the trajectories for approximately one nanosecond, most of the free-energy surface had been visited by at least one trajectory, and together they provided good statistical coverage of the entire region. By dividing the area into bins and statistically determining the mean time in each bin, along with the likelihood of transitioning into the other bins, a transition matrix was defined for the process, which was then modeled using Markov methods. The full theory behind this technique is described in [21], but Figure 6 is a visualization of the transition matrix as a pattern of directed arcs, with probability flux depicted as thickness along each tube. The combination of the static thermodynamic information in the free-energy landscape with a reduced but easily interpreted view of the path of a protein along the surface is abstractly visualized as hops between bins with corresponding branching probabilities. The full results of the simulation are described in [22], which presents the first publication of simulation work done with Blue Matter as part of the Blue Gene project. Although the behavior of the hairpin turned out to be too complex to be Markovian in our simple binning scheme, the technique shows promise for modeling protein transitions on the basis of data from multiple independent runs. We did not perform the simulation on Blue Gene hardware (since it did not yet exist), but the experiment was consistent with the research goals of the Blue Gene project and exercised Blue Matter along with its analysis framework.
Figure 6
| |
|
The next system we studied was a lipid bilayer consisting of 20K atoms, corresponding to a substantial increase in system size from the hairpin. Lipid membranes are biologically important because they form the cell wall and control the transfer of material into and out of the cell. They are also important as the location where many proteins perform their function, as described in the final section.
Figure 7 shows a cross section of the lipid–cholesterol simulation, with two layers of lipid forming the bilayer, a layer of water on the outside, and cholesterols interspersed among the lipids as they would be in a real membrane.4 A key role of analysis for this system is to characterize the way in which cholesterols interact with the lipids. One way to do this is with histograms of the atoms in the vicinity of the cholesterol, which can then be sliced and shown as the isolines in Figure 8(a). Three-dimensional visualization helps here by avoiding the need to slice the volume and instead showing a three-dimensional isosurface of the histogram [Figure 8(b)]. This immediately conveys the density of neighboring atoms around the cholesterol and reveals preferential associations at an atomistic level that can be accessed only via simulations such as these. Additional details of membrane simulation and analysis are discussed in the next section, which describes a lipid–cholesterol membrane and its interaction with an embedded membrane protein, rhodopsin.
Figure 7
Figure 8
| |
|
Membrane proteins are the focus of more than half of the contemporary drug targets [24]. It is difficult to obtain structural detail for membrane proteins at atomic resolution because of the challenges of growing crystals of proteins in a membrane environment [25, 26]. In addition to structure, the functional dynamics are accessible via spectroscopic techniques [27–29]. The added complexity of the membrane environment must be included in membrane protein studies, since the composition of the membrane can have dramatic effects on the function of a protein [30–32].
Simulation of membrane protein environments has received a direct benefit from recent advances in supercomputing. The ability to produce tens to hundreds of nanoseconds of a fully atomistic simulation of membrane environments has produced detailed characterization of their structure and dynamics to a degree previously unavailable and at a scale inaccessible by experiment. As a result, simulation can offer additional insight into the structure, dynamics, and environment of the membrane proteins themselves [33].
Rhodopsin is a member of the signaling protein family known as G-protein-coupled receptors (GPCRs) [34–37]. It functions as the first step in the signal cascade that results in the perception of light [38], but has the distinction of being the only GPCR whose structure is known with atomic resolution [25]. Although other workers have used molecular dynamics to study rhodopsin in a membrane environment [33], we have chosen to apply the capabilities of BG/L to simulate rhodopsin in a more complex, native-like environment of polyunsaturated lipids with cholesterol.
Setting up a large-scale simulation is technically demanding and requires a lengthy equilibration protocol. This is complicated by the fact that it is difficult to determine the point in time at which equilibrium has been reached. For a system as complex as rhodopsin in a two-component lipid matrix with cholesterol, the equilibration time alone can extend to tens of nanoseconds. Until recently, this would have represented the full time allotted to a complete simulation, because of the limited computing resource available.
Figure 9(a) shows a top view of the rhodopsin/lipid/cholesterol system, looking down on the membrane surface. The rhodopsin appears as a red ribbon; the green and blue respectively represent the SDPE and SDPC5, and the cholesterol atoms are shown as gray spheres. This snapshot is from early in the simulation, before the cholesterol distribution has had time to fully equilibrate. Experimental diffusion rates for cholesterol are of the order of 1 × 10−7 cm2/s [39], which means that many nanoseconds of simulation time are required for the cholesterol distribution to reach a steady state.
Figure 9
The changing cholesterol distribution around rhodopsin as it progresses toward equilibrium appears in Figure 9(b) as successive histograms of the radial distribution of cholesterol over time. The counts of cholesterol molecules at a given distance are shown on the vertical scale, where the light color indicates greater likelihood at that distance. Each vertical histogram is an average of the cholesterols over a 250-picosecond time window. The most likely range, which appears as a band of gray across the time axis, is fairly stable, suggesting that the setup protocol produced a configuration near equilibrium. The minimum and maximum ranges of each histogram, however, show a general trend downward. The minimum distances decrease, which indicates that equilibrium includes some cholesterols closely associated with rhodopsin.
Another, simpler, way to characterize the equilibration of the system is to study the mean radius of the cholesterols over time, which amounts to a further reduced representation of the histogram strip chart. Figure 10(a) shows the time evolution of the mean radius of cholesterol from the rhodopsin center of mass. The clear oscillatory pattern indicates fluctuations in the cholesterol distribution on a timescale of several nanoseconds. The amplitude of the oscillations is damping to a converged value, suggesting significant progress toward equilibrium in approximately 25 ns.
Figure 10
Figure 10(b) suggests that in the first 30 ns, part of the rhodopsin structure underwent a rotation of approximately 10 degrees from the initial setup conditions and then stabilized during the 30-ns to 40-ns time frame. Further monitoring is needed to establish stability in the overall rhodopsin orientation, since protein motion occurs on a timescale of tens of nanoseconds. It is not clear how much of this rotation is due to an internal conformational change in rhodopsin, or whether it represents an overall rigid body rotation as it adjusts to its environment; however, the fact that the protein is not making significant angular changes greatly simplifies the analysis of the neighboring lipid–cholesterol environment.
Care must be taken to validate interpretations of the rich detail offered by fully atomistic simulations of complex environments, such as the one described here. Each line of analysis should make contact with what can be experimentally observed. For example, some measurable spectroscopic quantities can be calculated from the simulation and compared with experiment. Once validated, the atomic detail can be examined further for mechanistic insight that is inaccessible experimentally. Simulation, once validated with experiment, can help interpret experimental results by providing a greater level of detail than is available with contemporary techniques.
Supercomputing resources such as BG/L have expanded the level at which complex biological phenomena may be studied. Most significantly, the newly harnessed capability enables investigations that could not have been attempted a few years ago. In this light, the potential advances with BG/L are in kind, rather than in degree. The rhodopsin simulations are ongoing and will be investigated in detail using techniques such as those described above.
| |
|
The Blue Gene/L project has been on a dual path to produce a massively parallel machine and associated scalable software to make use of its power. As more nodes of the machine become available, the scientific possibilities increase tremendously, since the simulations can then access more biologically interesting scales of time and complexity. Much insight has already emerged from Blue Matter simulations on an IBM SP computer and on a BG/L computer of up to 512 nodes, and we look forward to the completion of our work on rhodopsin and the continuation onto more complex systems as the node count expands.
| |
The authors gratefully acknowledge contributions from Scott E. Feller, valuable discussions with Bruce Berne, and the development effort of the Blue Matter team—Blake G. Fitch, Alex Rayshubskiy, T. J. Chris Ward, Yuri Zhestkov, and Maria Eleftheriou.
*Trademark or registered trademark of International Business Machines Corporation.
| |
| |
1The conserved quantity can be simply the total energy, in the case of a constant volume and energy simulation, or, more generally, it is the Hamiltonian, which can include terms related to temperature and pressure control. Some molecular dynamics simulations do not allow for the calculation of such a conserved quantity, but Blue Matter provides a conserved quantity for all of its simulation modes, which allows for important validation tests.
2Here, the conserved quantity includes terms for the energy transferred to the heat bath and the piston that controls pressure.
3These were NVE runs using the OPLSAA force field and SPC water in a 38-Å box. P3ME electrostatics provided fully periodic boundary conditions, and heavy-atom hydrogens were rigidly constrained, allowing one-femtosecond timesteps. (An NVE simulation has a constant number of particles, constant volume, and constant energy. OPLSAA = optimized potential for liquid simulation—all atom. SPC = single point charge. P3ME = particle-particle particle-mesh Ewald.)
4This simulation consisted of 72 SDPC lipid molecules and 24 cholesterols in two leaflets, plus 2,174 TIP3P water molecules, using the CHARMM27 force field; NVE ensemble with a 2-fs timestep and rigid bonds to hydrogen. (CHARMM = chemistry at Harvard molecular mechanics [23]. TIP3P = transferable intermolecular potentials with three point charges. SDPC = 1-stearoyl-2-docosahexaenoyl-sn-glycero-3-phosphocholine.)
5SDPE and SDPC are abbreviations for the chemical makeup of the lipids: SDPE is 1-stearoyl-2-docosahexaenoyl-sn-glycerophosphoethanolamine, and SDPC is 1-stearoyl-2-docosahexaenoyl-sn-glycero-3-phosphocholine.
Received August 20, 2004; accepted for publication September 28, 2004; Published online April 5, 2005.
|
|