|
Molecular dynamics simulations are an important tool in the study and rational design of molecular systems and materials, providing information about the behavior of chemical systems that can be difficult to obtain by other means.1 These properties generally are obtained as statistical mechanical averages of atom trajectories. Unfortunately, these averages usually converge only slowly with the length of the simulation or the size of the molecular system. Considerable computational resources are required, even for small molecular systems (tens of thousands of atoms) and short simulation times (nanoseconds).
Several strategies have been suggested to take advantage of the development of massively parallel computer architectures. Molecular dynamics simulations generate the time evolution of chemical systems, which is an inherently sequential process. The calculation of pair-wise additive atomic forces needed to advance the system a discrete time step, however, does lend itself to parallelization. Parallelization approaches fall in two categories: data decomposition (including replicated data, atom decomposition, and force decomposition) and domain decomposition. Theoretical scaling of memory, calculation, and communication requirements for the different parallelization methods is given in Table 1, where N is the number of atoms in the molecular system, and p is the number of processors used. The scaling of communication in the domain decomposition is valid for systems with large numbers of atoms, a large number of processors, and use of a limited interaction range.
|
|
Table 1 Theoretical parallel scaling of molecular dynamics simulation parallelization strategies
|
| |
| |
| |
| |
| |
|
Memory
|
Computation
|
Communication
|
|
|
Replicated data
|
N
|
N/p
|
N
|
|
Atom decomposition
|
N/p
|
N/p
|
N
|
|
Force decomposition
|
N/ p
|
N/p
|
N/ p
|
|
Domain decomposition
|
N/p
|
N/p
|
N/p |
|
In the replicated data approach, the computational work is distributed over the available p processors, each of which holds a complete copy of the N-particle system. At each step, a processor calculates the interactions for N/p atoms and updates the forces and coordinates, requiring all-to-all communication. Atom decomposition is a variant in which the atom data are distributed, but all-to-all communication is still required. Replicated data and atom decomposition approaches have been used to parallelize existing sequential simulations codes.26 However, the cost of communication makes these approaches less efficient on massively parallel computers. In replicated data and atom decomposition, the calculation of the force matrix is generally row-wise distributed over the processors. Block-wise distribution of the force matrix calculations leads to force decomposition approaches with a more balanced memory and communication requirement.7 In all of these approaches, the list of atoms is directly distributed, and the memory requirements and cost of computation and communication can be precisely controlled for each processor. In this way the complexity of load balancing is significantly reduced.
In the second class of parallelization approaches, the physical space in the molecular system is distributed instead of the atom list.8-13 In this scheme, each processor is responsible for the calculation of forces of the atoms that occupy the region or domain assigned to the processor. This scheme distributes the atoms over the available processors, reducing the memory requirements. For the calculations of local interactions, data are needed only for atoms in neighboring domains, thereby reducing the amount of communication. This approach is especially efficient for large molecular systems simulated on massively parallel computers. Implementations of this approach are much more complex than those based on data decomposition because of the fluctuating number of atoms per processor and the required periodic reassignment of atoms that move from the domain of one to the domain of another processor. A second complication is the need for dynamic load balancing.
The performance of a parallel algorithm on n processors is generally given in terms of the parallel speedup S or the parallel efficiency E, relative to a smaller number of processors m, defined as:
and
where Tm is the wall-clock time to complete the calculation on m processors. This measure of parallel efficiency not only depends on the algorithm but also on the computer architecture. In particular, the cost of communication is an important factor in parallel efficiency. In general, efficient parallel algorithms will minimize the amount of data to be communicated by data replication, data packing, or data recalculation, minimize the number of messages, and avoid simultaneous access to the same resource.
NWChem
The NWChem software suite provides many computational chemistry modules for molecular and periodic systems, ranging from quantum mechanical to classical mechanical methods.14,15 These approaches may be combined to perform quantum molecular dynamics and QM/MM (quantum mechanical/molecular mechanical) simulations. The implemented methods are available on most of the high-performance, massively parallel computer architectures, but also on single, or clusters of, desktop workstations and personal computers running the Linux** operating system.
The computational modules for molecular electronic structure calculations include self-consistent field (SCF) or Hartree-Fock (RHF [restricted Hartree-Fock], UHF [unrestricted Hartree-Fock], high-spin ROHF [restricted open-shell Hartree-Fock]), Gaussian orbital-based density functional theory (DFT) using many local and nonlocal exchange correlation potentials, perturbation theory (MP2, or Moller-Plesset 2nd order energy correction, RI-MP2), complete active space SCF (CASSCF), coupled-cluster singles doubles and noniterative triples correction (CCSD(T)) and selected-CI (configuration interaction). In addition, there are modules for pseudopotential plane-wave electronic structure and periodic Gaussian-based DFT calculations. The classical mechanics module includes molecular dynamics simulation and free energy perturbation and integration.
The chemistry modules in NWChem are built upon a common set of high-performance parallel software tools. The separation of these tools from the chemistry modules provides the flexibility to adapt and tune the code for different computer architectures. A run-time database (RTDB) provides the mechanism with which information is shared between modules. In general, there is a single database per calculation with parallel read/write access by all processors. The memory allocator (MA) provides a standardized tool for the dynamic allocation of local memory, i.e., memory not shared with other processors. The MA library includes routines for management, debugging, verification, usage statistics, and availability of heap and stack memory. The Global Array (GA) toolkit provides a distributed data mechanism for the allocation and access of nonlocal memory.16 One of the most important features the GA tools provide is the one-sided asynchronous access to global memory elements. The GA memory model allows the application programmer to determine the locality of data and manage the memory hierarchy in the parallel application. Other parallel tools include parallel eigensolvers (PeIGS) and three-dimensional parallel Fast Fourier transformations (pFFT).
Molecular dynamics simulation
For molecular dynamics simulations, NWChem includes the five modules as illustrated in Figure 1. The PREPARE module is used to analyze the molecular structure provided as a Protein Data Bank (PDB) file and, using parameter database files for the specified force field, generates the two data files required for the simulation: a topology file containing static information on the system, such as the force field parameters, lists of bonds, angles and torsions, and excluded pairs, and a restart file containing the dynamic information on the system, such as atomic coordinates and velocities.
Figure 1
The NWMD (Northwest Molecular Dynamics) module contains the driver routines for energy minimization, molecular dynamics simulations, thermodynamic perturbation, and thermodynamic integration calculations. This module relies on a module that deals exclusively with the calculation of classical atomic forces and energies (CAFE) and a module called SPACE that handles the data distribution and associated communications for domain decomposition. This separation of tasks will allow for the future implementation of other data decomposition algorithms in modules that can replace SPACE, without the need for modifying other modules.
The ANALYZE module contains routines for the analysis of molecular trajectories generated and currently includes root-mean-square deviation (RMSD) analysis, analysis of internal coordinates (Ramachandran plots),17 and essential dynamics (ED) analysis.18,19
Domain decomposition offers the best theoretical scalability of memory use and communication cost for molecular dynamics simulations of large molecular systems on massively parallel computers. This decomposition is implemented in NWChem as follows. The complete molecular volume is defined as the periodic cell in systems with periodic boundary conditions and the smallest rectangular box that completely encloses the molecular system for nonperiodic systems. By slicing the molecular system in all three dimensions, the simulation volume is divided into rectangular subboxes. Next, these subboxes are assigned to the available processors. Each processor owns a range of subboxes in each of the three dimensions, i.e., the processors are logically arranged into a three-dimensional grid. In this way, each processor owns a subgrid of subboxes. This arrangement preserves locality of the subboxes and allows a unique number to be assigned to each processor, as well as to each subbox from which the location in the processor grid and the subbox grid can be determined without need for communication. Although currently not implemented, the assignment of subboxes to processors can be easily modified to take advantage of the fact that communication between same-node processors is more efficient than communication between off-node processors.
There is no restriction on the size of the subboxes, and the user has complete control over the number of subboxes that are defined. Based on the cutoff radius and the size of the subboxes, each processor generates a pair list of subboxes for which interactions need to be evaluated. Initially, each processor generates this list for subbox pairs in which local subboxes and nonlocal subboxes in half of the directions are involved, i.e., north, but not south, etc. Pairs of subboxes on the same processor are always handled by that processor. In this way all subbox pairs with interactions within the cutoff radius are included in one of the subbox pair lists without double counting.
Note that interactions may also extend beyond subboxes on the immediate neighboring processors. This is basically a distributed implementation of a cell-index list. Since the distribution and size of all subboxes is known on all processors, periodic boundary conditions are easily imposed. Since all data in a subbox are transferred between processors, subboxes should not be made too much larger than the cutoff radius. They can, however, be smaller than the cutoff radius. The current implementation will make the subboxes as large as possible under the restrictions that the size should be less than the cutoff radius times a factor slightly larger than one, and the restriction that each processor should at least own one subbox. This default decomposition can be changed by the user with explicit decomposition input directives.
Figure 2 depicts a two-dimensional representation of the distribution of subboxes. The space a molecular system occupies is divided into subboxes that are distributed over the available processors. This figure illustrates the two main advantages of this distribution: reduced memory requirement, and reduced communication. The processor in red owns a subset of the molecular system and needs to evaluate only interactions with atoms in subboxes within the cutoff radius on half of its neighboring processors, identified by the blue subboxes. In this figure the use of periodic boundary conditions is assumed, resulting in interaction between the red processor with the blue area on the processor in the top row. Note that in this figure each processor owns four subboxes, and that the cutoff radius extends two subboxes.
Figure 2
In NWChem, each processor sequentially processes the subbox pairs in its list, retrieving the atomic coordinates from the remote node, evaluating the contribution to forces and energies, and accumulating force contributions to the appropriate array on the remote processor. The subbox pair list on each processor is ordered such that atomic coordinates from a remote subbox are retrieved only once per molecular dynamics step. Similarly, atomic force contributions for a remote subbox are accumulated to the remote processor only once. For each subbox pair, an atom pair list is periodically evaluated and locally stored for the calculation of the atomic forces. All communication in the processing of the subbox pair lists is performed with one-sided asynchronous GA communication calls.
Domain decomposition molecular dynamics is efficient because of the locality of the atomic interactions when cutoff radii are used. However, especially for electrostatic interactions, the use of cutoff radii can be a serious approximation. To be able to account for electrostatic interactions beyond the cutoff radius in an approximate way, the particle-mesh Ewald (PME) method has been implemented.20 In this method, short-range interactions are evaluated as usual, but with a scaling function that depends on the cutoff radius. Long-range contributions are approximated in the form of a discrete convolution on an interpolated grid of charges. This method requires the Fourier transform of the charge grid to perform the convolution efficiently. The pFFT implemented in NWChem distributes the grid in slabs for efficiency. The implementation in NWChem allows the use of a subset of the available processors to be used for the transforms while the other processors are evaluating local forces. Since all processors need the potential grid to calculate the PME forces, this part of the calculation is separated from the evaluation of the charge grid and performed after all local forces have been evaluated. In practical simulations, this separation avoids synchronization of all processors before the PME forces evaluation.
Load balancing
After the calculation of the forces, a synchronization of all processors is required to ensure that all contributions from remote processors have been properly accumulated. Particularly in simulations of heterogeneous systems, the time to calculate the forces will be different on each processor, resulting in substantial synchronization time at this point in the calculation. To make molecular simulations of such systems efficient, dynamic load-balancing techniques are needed. Because the force evaluation in NWChem involves only asynchronous communication, the distribution of the idle time on each processor at this synchronization can be used to determine how the work can be more evenly distributed.
In NWChem, two methods for load balancing have been implemented. The first method is based on a redistribution of the subbox pair lists.21 To avoid unnecessary communication or replication of data, in NWChem each subbox pair is handled by one of the processors that owns one of the subboxes. In the current implementation, each processor, in order of increasing synchronization time, transfers one interprocessor box pair to its least busy neighbor processor for which it handles the subbox pair. Further, each processor can only receive one additional subbox pair. This cascading subbox pair distribution avoids the back and forth exchange of a subbox pair between the two busiest processors if only one subbox pair is exchanged, as was observed for some simulations. The subbox pair redistribution requires only minimal additional communication. Furthermore, this method allows the interatomic calculations for each subbox pair to be evaluated in the same order, leaving the generated molecular trajectory unaffected by the load balancing.
The second load-balancing technique is based on a resizing of the processor domains.22 The size of the subboxes on the busiest processor is reduced to reduce the workload. This reduction increases the size of the subboxes, and the corresponding load, on the other processors. A limit on the minimum size of the subboxes ensures that the subbox pair lists cover all interactions within the specified cutoff radius without increasing the total number of subbox pairs. This method of load balancing has the disadvantage of requiring the communication of atomic data because of the resulting redistribution of atoms, as well as divergence of the molecular trajectory caused by differences in numerical accuracy from the changed order in which contributions to the atomic forces are evaluated as a result of the subbox resizing.
In NWChem, both load-balancing techniques can be used in a simulation. The subbox pair list redistribution does not change the order in which interactions are evaluated, and thus does not suffer from numerical differences that may lead to diverging trajectories. Moreover, the method exchanges subbox pair entries only, and thus requires a minimum of communication. Subbox pair resizing, however, requires a redistribution of atoms and thus significantly more communication. Also, since the number and order of atoms in each subbox is changing, the order of the calculations of interactions is changing. Consequently, numerical differences lead to diverging trajectories. Therefore, in the combined load balancing, subbox resizing is applied only after a specified number of load-balancing steps by subbox pair redistribution do not reduce the accumulated synchronization time.
Load-balancing efficiency of NWChem
The basic components in a molecular dynamics step as implemented in NWChem are depicted in Figure 3 from the top down and include the following: the load-balancing step, atom redistribution, global synchronization to ensure coordinates are current and available on all processors, PME charge grid evaluation, processor subset synchronization, reciprocal space calculations that include two three-dimensional Fast Fourier Transforms, recording of the trajectory, the calculation of interatomic forces in a loop consisting of retrieval of coordinates, calculation of pair lists and forces and accumulation of these forces, a global synchronization to ensure forces have all been evaluated, and the calculation and recording of properties. All steps between the two global synchronizations include only asynchronous communication, which allows time stamps around the second global synchronization to be used to determine the idle time on each of the processors due to load imbalance. This synchronization time is used in the dynamic load balancing based on the implementations described in the previous section.
Figure 3
In Figure 3, boxes off-center to the left are functions performed by the domain decomposition model SPACE, the center boxes are functions performed by the driver, and off-center boxes to the right are force field functions performed by the CAFE module. The two horizontal red lines represent global synchronizations to ensure all coordinates are current before the forces are calculated, and to ensure that all forces are available before advancing the coordinates. Boxes in red contain global operations, the one in green contains processor subset communications, and those in blue contain asynchronous communications. The green horizontal line represents a processor subset synchronization to ensure that the PME charge grid calculation was completed. The red dashed line is a wait state on each processor until the PME potential grid is available. Note that recording of the trajectory is included in the load balancing and involves asynchronous communication initiated by the processor that performs I/O operations.
The efficiency of the load balancing can be expressed in terms of the time t from immediately before the first global synchronization to immediately after the second global synchronization, which will be the same on all n processors, and the accumulated synchronization time Tsynch = ni=1 tisynch of the second global synchronization.
=
|
nt Tsynch
|
(3)
|
|
|
nt
|
To determine the behavior and efficiency of the different load-balancing options in NWChem, molecular dynamics simulations of the enzyme haloalkanedehalogenase in aqueous solution (see Figure 4, which shows the secondary structure elements, within space-filling representation, the Na+ counter ions and the active site residues Asp124, Asp260, and His289) were carried out on 27 processors of the IBM SP* with single P2SC processor nodes, the IBM SP with four 604 processor silver nodes, and the Cray T3E**-900. The molecular system consists of the 4858-atom enzyme, 17 counterions, and 12128 water molecules in a periodic simulation volume. The simulations were carried out using the particle-mesh Ewald technique to account for electrostatic interactions beyond the cutoff radius, at a constant temperature of 298 K (Kelvin) and constant pressure of 1 atmosphere. Included in the load balancing is the recording of the coordinates of the system to a trajectory file, every ten molecular dynamics steps.
Figure 4
In Figures 5, 6, and 7, the measured accumulated synchronization times and total wall-clock times for a single molecular dynamics step in seconds versus simulation time in picoseconds (ps) are given. The times are for a 10000-step molecular dynamics simulation of solvated haloalkanedehologenase performed on 27 processors. They are shown for simulations without load balancing (blue curves), load balancing based on subbox redistribution only (green curves), subbox pair resizing only (black curves), and the combination of these (red curves). Figure 5 illustrates the times for the Molecular Science Computing Facility (MSCF) IBM SP with P2SC processors at the Pacific Northwest National Laboratory software computational chemistry applications classical molecular dynamics quantum mechanical calculations architecture memory domain decomposition intermolecular interactions periodic atomic reassignmentsoratory, and Figure 6 illustrates them for the MSCF experimental IBM SP with silver nodes. Figure 7 illustrates the times for the National Energy Research Scientific Computing Center (NERSC) Cray T3E-900 at the Lawrence Berkeley National Laboratory. In the simulations using the combined load balancing, subbox resizing was applied only after 10 successive subbox pair redistributions did not reduce the accumulated synchronization time.
Figure 5
Figure 6
Figure 7
On all three machines, the two load-balancing methods reduce the accumulated synchronization time to less than half the time measured for the simulations without load balancing. The combination of the two methods reduces the accumulated synchronization time by an order of magnitude, resulting in the reduction of the wall-clock time per molecular dynamics step by a factor of two.
The fluctuations in the measured accumulated synchronization time presented a particular challenge in the design of the load-balancing algorithms, especially for the IBM SPs. Such fluctuations are expected for computer systems with higher latency and lower bandwidth, increasing the chances for contention. Another possible cause is the need for resources by deamons run by the operating system. This is especially problematic for molecular dynamics simulations, since the wall-clock time between synchronizations is usually very short. In the original implementation of the load-balancing methods, the fluctuation in the accumulated synchronization time was too large for the dynamic load balancing to be effective on the IBM SP. In the current implementation, the dynamic load-balancing algorithms use the minimum or the average accumulated synchronization time over a number of time steps. Even with the averaging used in this study, the difference in these fluctuations between the IBM SP and Cray T3E are significant, as can be seen from Figure 8. It shows the accumulated synchronization time and wall-clock time for a single molecular dynamics step in seconds versus simulation time in ps for a 10000-step molecular dynamics simulation of solvated haloalkanedehalogenase performed on 27 processors of the Cray T3E-900 (red) and IBM SP (blue), with combined subbox pair redistribution and subbox resizing.
Figure 8
In Table 2 the accumulated time for each component of the force calculation with combined load balancing is given, with the total wall-clock time per molecular dynamics step and the efficiency of the dynamic load balancing, which is around 0.9 or better. The components without communication, such as the calculation of local or remote forces, are slightly faster on the Cray T3E. All communication is performed using the Global Array toolkit, and the observed differences in timings for data retrieval and remote data accumulation on the IBM SP and Cray T3E are consistent with reported Global Array performance on those machines.23 Communication on the IBM SP 604 is less efficient than on the IBM SP P2SC. Since the parallel Fast Fourier Transform includes communication, its timing on the 604 processors is longer than on the P2SC.
|
|
Table 2 Accumulated wall-clock time in seconds for different components
|
| |
| |
| |
| |
| |
|
Cray T3E
|
IBM SP P2SC
|
IBM SP 604
|
|
|
Replicated data interactions
|
0.000
|
0.001
|
0.002
|
|
PME charge grid calculation
|
1.969
|
4.561
|
3.486
|
|
PME subset synchronization 1
|
0.565
|
2.141
|
1.138
|
|
Reverse FFT
|
0.629
|
1.089
|
2.260
|
|
PME reciprocal work
|
0.112
|
0.294
|
0.215
|
|
PME subset synchronization 2
|
0.022
|
0.110
|
0.172
|
|
Forward FFT
|
0.608
|
1.167
|
2.235
|
|
I/O
|
0.034
|
0.058
|
0.025
|
|
Local coordinate retrieval
|
0.011
|
0.020
|
0.022
|
|
Local force calculation
|
13.418
|
17.718
|
14.122
|
|
Local force accumulation
|
0.009
|
0.018
|
0.019
|
|
Remote coordinate retrieval
|
0.250
|
0.956
|
1.384
|
|
Remote force calculation
|
10.467
|
14.560
|
10.858
|
|
Remote force accumulation
|
1.215
|
1.028
|
2.283
|
|
PME wait
|
0.002
|
0.002
|
0.002
|
|
PME forces calculation
|
1.111
|
2.749
|
4.638
|
|
Synchronization
|
2.359
|
6.814
|
4.653
|
|
MD step total
|
1.301
|
2.099
|
1.934
|
|
Load balancing efficiency
|
0.942
|
0.893
|
0.920 |
|
Conclusion
The efficiency of molecular dynamics simulations on massively parallel computers depends critically on the availability of efficient dynamic load-balancing techniques. The molecular dynamics module of NWChem is based on a domain decomposition approach in which all remote data access during the evaluation of atomic forces is performed by one-sided asynchronous communication operations. This approach allows the dynamic load balancing to be determined by the measured synchronization time at the end of the force evaluation. A combination of redistribution of calculation of forces from interprocessor subbox pair interactions and resizing of processor domains is shown to lead to much improved efficiency of molecular dynamics simulations.
Acknowledgment
This work was supported under the auspices of the U.S. Department of Energy (DoE) Office of Biological and Environmental Research (OBER), which funds the Environmental Molecular Sciences Laboratory (EMSL). Pacific Northwest National Laboratory software computational chemistry applications classical molecular dynamics quantum mechanical calculations architecture memory domain decomposition intermolecular interactions periodic atomic reassignmentsoratory is a multidisciplinary laboratory operated by Battelle for the U.S. Department of Energy. Computer time on the IBM SPs was provided by the Molecular Sciences Computing Facility at EMSL and on the Cray T3E by the National Energy Research Scientific Computing Center (NERSC). J. A. McCammon's work is supported in part by a grant from the National Science Foundation (NSF).
*Trademark or registered trademark of International Business Machines Corporation.
**Trademark or registered trademark of Cray Inc. or Linus Torvalds.
Accepted for publication December 29, 2000.
|