IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Journal of Research and Development

Blue Gene   Volume 49, Number 2/3, 2005
Table of contents: HTMLPDF This article: HTML PDFDOI: 10.1147/rd.492.0447Copyright info

Early performance data on the Blue Matter molecular simulation framework

by R. S. Germain,
Y. Zhestkov,
M. Eleftheriou,
A. Rayshubskiy,
F. Suits,
T. J. C. Ward,
and B. G. Fitch

Blue Matter is the application framework being developed in conjunction with the scientific portion of the IBM Blue Gene® project. We describe the parallel decomposition currently being used to target the Blue Gene/L machine and discuss the application-based trace tools used to analyze the performance of the application. We also present the results of early performance studies, including a comparison of the performance of the Ewald and the particle-particle particle-mesh (P3ME) methods, compare the measured performance of some key collective operations with the limitations imposed by the hardware, and discuss some future directions for research.

Introduction

One of the two major goals of the Blue Gene* project announced by IBM in December 1999 was the application of the massive computational power being developed for that project to advance our understanding of biologically important phenomena, such as protein folding via large-scale simulation [1]. Studying the mechanisms behind protein folding and related areas can require long biomolecular simulations on a wide range of system sizes. The application supporting this work must map well onto a range of parallel cluster sizes to optimize scientific throughput for a particular study. System sizes of interest range from 5,000 atoms (for a small peptide in water) through 30,000 to 50,000 atoms (for a folded 200-residue protein in water or a small protein membrane system) up to 200,000 atoms or more (for larger-scale membrane systems or an unfolded protein in water). Our design point targets system sizes in the 10,000- to 100,000-atom range.

The application effort we have made as part of the Blue Gene project also has two goals: first, to support the scientific goal of the Blue Gene project related to the use of biomolecular simulation for studies of biologically important phenomena, and second, to explore novel approaches to programming applications that target massively parallel system architectures in the context of a concrete problem. To address these goals, we have developed an application framework, called Blue Matter, that is currently focused on biomolecular simulation, and whose architecture has previously been described [2]. A version of Blue Matter running on conventional hardware (IBM pSeries*) has been used in production for scientific work [3], and we are now doing production runs on Blue Gene/L (BG/L) prototype hardware.

One of the principal design goals for this framework is to encapsulate the semantics of biomolecular simulation through the definition of appropriate interfaces so that explorations of performance and scalability alternatives can take place with minimal intervention from the domain experts on the team. Furthermore, we have decomposed the application into a core parallel engine that runs on BG/L and a set of support modules providing setup, monitoring, and analysis functionality that can run on other host machines. Minimizing system environmental requirements for the core parallel engine enables the use of nonpreemptive low-overhead parallel operating system kernels [4] to enable scalability to thousands of nodes.

In this paper, we describe some early performance data obtained on BG/L using the Blue Matter application. Our methodology for acquiring and analyzing performance data is described, along with the results of a comparison between two algorithmic alternatives for evaluating long-range electrostatic forces. We present a comparison of the measured performance of a key component of our application (in the context of our current parallel decomposition) with the potential performance of the hardware. Finally, we give a brief description of our plans for further explorations.

BG/L from an application perspective

While detailed descriptions of the BG/L hardware platform exist elsewhere [5], it is useful to review some of the aspects of the machine that are most interesting to application developers. First, there are two identical IBM PowerPC* 440 central processing units (CPUs) on each chip, and each CPU has a dual floating-point unit (FPU). This FPU executes instructions in single-instruction multiple-data (SIMD) fashion, like a two-element vector processor. Making more effective use of these double FPUs is still an area of intense activity within the Blue Gene project. Although the CPUs are identical, one of the CPUs is intended for use as a communication coprocessor when executing a communication-intensive application. There are two main modes of operation supported by the system software:

  • Coprocessor mode, in which one CPU is used for computation and the other is used for communication.
  • Virtual node mode, in which both CPUs carry out both communication and computation and, from the perspective of the application, support independent processes that communicate only via message passing.

The results reported here were all taken in coprocessor mode and with the compiler flag -qarch=440, which does not take advantage of the double FPU. As our explorations of parallel decompositions and the work on code generation by the compiler continue, we expect to remove both of these restrictions.

BG/L has two physically separate networks used for high-performance communication between nodes: a three-dimensional (3D) torus network in which every node is connected to its six nearest neighbors, and a hardware collective network that allows rapid broadcasts and reductions. The collective hardware supports fixed-point operations within reductions. For the results reported here, a fixed-point MPI_AllReduce implemented on the hardware collective network is used to make atom positions stored on individual nodes visible to all of the nodes in the system (an operation equivalent to MPI_Allgatherv). The accumulation of forces on each particle in the system is carried out using a floating-point MPI_AllReduce operation implemented on the torus network. The torus network is also used as the communication fabric for the 3D fast Fourier transform (FFT) [6] that forms an integral part of the particle-particle particle-mesh (P3ME) technique used to handle electrostatic forces.

Application-level tracing on massively parallel machines

To tune application performance, including identifying load balance issues on a machine such as Blue Gene/L, tools capable of acquiring, managing, and analyzing data from thousands of nodes are a necessity. To address this need for our own code development, we developed an application-based tracing methodology that is lightweight, can be turned off entirely via recompilation, and for which we have developed tools that can handle hundreds or thousands of nodes. The current implementation has C++ bindings only, but the techniques involved could be used with other languages as well.

Currently, two different sorts of analysis are performed on the trace data created in this way. First, the acquired trace data can be viewed directly in a visualization tool, as shown in Figure 1. Second, statistical analysis of the trace data from a series of runs at different node counts can be carried out in an automated way and output in tabular form. Because the trace visualization tool shows the actual timestamps, skew between nodes as well as load imbalance can be identified.

Figure 1 Figure 1

Parallel decomposition and early performance data

The decomposition currently implemented by the Blue Matter application is a replicated data decomposition in which a pair of arrays containing all of the positions of particles and the net force on each is maintained. A number of papers have described the taxonomy of parallel decompositions of molecular dynamics along with their scaling properties with respect to space per node, computational burden, and communication volume [79]. In particular, when large systems with several hundred thousand particles or more are the target, a spatial decomposition scheme is typically used in which the problem domain is divided into subvolumes assigned to specific nodes. A replicated data scheme was chosen as the first decomposition for exploration primarily because it simplifies load balancing. It has the disadvantage that the amount of memory required on each node and the volume of global communication required are both proportional to the number of particles being simulated and independent of the processor count. For the range of system sizes (10 to 100,000 atoms) and node counts (512 or smaller) that would be available during the initial phases, calculations based on projected hardware capabilities indicated that performance would be adequate [2].

The system of particles is partitioned into fragments that comprise small groups of particles that are connected by bonds. We impose the rule that a bond whose length is constrained cannot cross fragment boundaries. The current decomposition binds individual fragments to a particular node. To support the calculations required for molecular dynamics, the positions of each particle are globalized via an AllReduce operation by first doing a fixed-point reduction in a vector of 3N double-precision numbers, where N is the number of particles, and then a broadcast. (Both are currently implemented using the hardware collective network.) To avoid replication of pairwise force calculations, the summed forces on each particle are also computed via a global (floating-point) reduction, which is currently implemented on the torus network. Both of these operations are invoked via MPI collective operations on MPI_COMM_WORLD. The global force reduction also simplifies load balancing, because contributions to the summed force vector can come from any node.

Although these operations are nonscalable (in the sense that they should be approximately constant time, independent of node count), as long as they represent a small fraction of the total timestep, this approach should be viable. It is instructive to attempt to estimate the ultimate hardware limits to performance for these operations. For example, the position globalization uses the hardware collective and fixed-point operations that are built into the collective hardware. Since the collective network bandwidth is 4 bits per cycle, and assuming that the collective network can stream at full bandwidth in both directions, the time required is approximately 69 ns per particle. The comparisons between the limits established by this estimate and the performance measured at the application layer are given in Table 1.


Table 1  Comparison of the hardware bandwidth-limited time and measured times for the integer AllReduce on the hardware collective used in the position globalization operation on two different system sizes. The data reported here is taken for 512 nodes. The bandwidth-limited times were computed by simply taking the data volumes for the positions 3 × Natoms × sizeof(double) and dividing by the collective bandwidth, 4 bits per cycle assuming a 700-MHz processor clock. A very large fraction of the hardware-limited bandwidth (more than 90%) is realized at the application.
Atom countHardware bandwidth-limited time
(ns)
Integer AllReduce (measured)
(ns)

5,289363,000396,000
43,2222.96 × 1063.15 × 106

Algorithmic explorations

Ewald implementation

In the direct implementation of the Ewald method [10], the computationally expensive structure factors are evaluated explicitly for all charges and reciprocal space momenta. A number of algorithmic improvements have been introduced to reduce the number of trigonometric function calls and the total number of floating-point operations.

The parallel Ewald algorithm is structured in the following way. The positions and charges of all particles are available to each node after the position globalization operation. Each node is assigned a subset of particles for which it calculates the individual structure factors for all reciprocal space momenta. The global structure factors are summed over all particles for each momentum using the global reduction operation. The forces on particles assigned to a processor are computed using both global and individual structure factors.

The only communication specific to this parallel Ewald algorithm is the global reduction of structure factors. The forces on particles from the reciprocal space contribution of the Ewald method are computed on the nodes to which these particles are assigned and do not require any additional communication. The cost of this communication step should be constant as a function of node count since the volume of floating-point data is fixed.

Parallelization of the P3ME method

In this section we describe the parallel implementation of the P3ME algorithm [11] for an efficient evaluation of the long-range forces. The P3ME method on BG/L is implemented via a volume decomposition of the simulation volume (and the corresponding FFT grid) onto the processor node mesh. We briefly describe the parallel computation of the P3ME method:

  • The point charges are mapped to a grid with weights determined by the Euler exponential spline coefficients, the cardinal B-splines. It is assumed that the number of grid points is sufficiently large and the interpolation level is sufficiently small that point-charge positions required for the interpolation have been communicated to each processor as part of the real space intermolecular force evaluation.
  • A forward 3D FFT is carried out on the meshed charge data using the algorithm described in a companion paper [6].
  • Pointwise multiplication is performed by the kernel: [4piexp(−g2/4alpha2ewd)/Vg2].
  • Inverse 3D FFT is performed on the resultant product.
  • Forces are calculated in real space using the transformed product function and the derivatives of the weighting functions.
  • All nodes participate in an all-reduction (floating-point summation) implemented over the torus network, with each node contributing the results of its assigned interactions (from real space and k-space).
  • In the current decomposition, each atom is bound to a particular node, and the positions and velocities of each atom are propagated forward in time by numerical integration of the equations of motion.
Parallel performance measurements

Parameters used in simulations

To carry out a fair comparison of the Ewald and P3ME method performance, it is necessary to select parameters that give comparable numerical accuracy for both methods. There exist analytical estimates of the accuracy for both methods [1213]. We used the numerical accuracy of the energy calculation as compared with the result at large real-space cutoff and large Kmax as a rough measure.

The parameters that affect the accuracy of the reciprocal space of the P3ME method are the mesh spacing size, which is usually selected to be between 0.5 Å and 1.0 Å, the alpha parameter, the number of mesh points for charge assignment, and the algorithm to compute electric fields at the mesh points. For the latter, the choices in Blue Matter are the “analytical” and “gradient” methods, which are named following [11]. They differ in the number of inverse FFTs involved in force and energy calculations, one for analytical and four for gradient. In the measurements reported here, we use the analytical method. For the Ewald method, the relevant parameters are alpha and the cutoff momentum, which is proportional to the parameter called Kmax and inversely proportional to the box size. The real space contribution accuracy is the same for both methods, as determined by the value of the cutoff in the real space and the parameter alpha.

For the purposes of the current performance runs, the real-space cutoff was fixed at 10 Å, with the switch function in the potential gradually decreasing to zero within the 1-Å-thick shell. The function is such that the resulting forces are smooth around the switch region. The integrator is velocity Verlet [14], with the timestep size of 1 fs for the 5K-atom hairpin system and 2 fs for the 43K-atom membrane system. These timestep choices give good energy conservation for their respective systems. The charges were assigned to a cube surrounding each particle with four mesh points extant in three dimensions. In the hairpin simulation, we used a mesh size of 64 × 64 × 64; for the larger rhodopsin system, the mesh size was chosen to be 128 × 128 × 128. We set the alpha parameter to 0.28. The Kmax parameter was set to 11 for the hairpin system and 20 for the rhodopsin system. This choice of parameters gave a reasonable accuracy (about six significant digits) for both the Ewald and P3ME methods.

Two different molecular systems were used for these studies. The smaller system was a solvated β-hairpin system comprising 5,239 atoms, including 1,660 single-point-charge (SPC) water molecules, and used the Optimized Potential for Liquid Simulation—All Atom (OPLSAA) [15] force field. The larger system, a solvated lipid/protein system, comprised 43,222 atoms, including 7,400 SPC water molecules and used the CHARMM [16] force field.

Results

We carried out this study to understand the scalability characteristics of our current implementation of the molecular dynamics application on the BG/L architecture using two different methods for evaluating the long-range forces P3ME and Ewald. Message Passing Interface (MPI) was used as the communication layer; the fixed-point collective operation that was used to globalize the particle positions was implemented on the hardware collective network; and the floating-point collective operations used to reduce the force contributions and compute the structure factors for the Ewald technique were implemented on the torus network.

We report performance numbers on 32, 128, 512, and 1,024 nodes [Figures 2(a)–2(d)]. All of this data was taken in coprocessor mode, with only one CPU per node used for computation and without any use of the double FPU. Load balancing and autotuning of guard zones for Verlet lists take place during the first few hundred timesteps of the simulation, so our data is obtained by running the simulation for 1,300 timesteps and using the last 100 steps to compute the averages for the CPU times. The component contributions to the total execution time can be measured using the trace tool described in the section above on application-level tracing.

Figure 2 Figure 2

For the Ewald method, a plot of the major contributions is shown in Figure 2(a). The k-space portion of the nonbonded interactions, the portion of the calculation that is different in the P3ME method and in the Ewald method, is further broken down in Figure 2(b). The AllReduce is the communication step that performs the global reduction of data over all of the processors to compute the global structure factors S(k). The cost of this communication is essentially independent of the number of processors and is the only nonscalable part that is specific to this Ewald implementation (other nonscalable components are common to both the current Ewald and P3ME implementations). This is the contribution that limits the performance of the reciprocal space Ewald computations at high node counts.

Figure 2(c) shows the parallel performance of the Blue Matter application using the P3ME technique for calculating the long-range electrostatic forces. The data shown in the figure was obtained for a molecular system comprising a solvated β-hairpin (peptide chain) containing 5,289 atoms. In the breakdown of the total time per timestep, the major contributors are, once again, the real space and k-space nonbond calculations; at the higher node counts, the contribution of the floating-point collective (ReduceForces) becomes significant. We should note that the initial implementation of the collective used for the GlobalizePositions operation was on the torus and had about the same performance as the ReduceForces operation. The utilization of the hardware collective network to implement the GlobalizePositions operation resulted in significant performance improvement at high node counts, and we expect that eventual optimization of the floating-point collective operation used for the ReduceForces operation will yield additional improvements.

Figure 2(d), a more detailed view of the k-space contributions, shows that the forward and inverse FFTs are the dominant contributors, and both scale well up through 512 nodes. At 1,024 nodes, the forward FFT displays a slowdown, while the inverse FFT continues to scale. We believe that this is an artifact caused by differences in data alignment and/or an issue with the current MPI driver rather than a fundamental limitation of this P3ME implementation. The FFT itself would ideally be limited only by the bisectional bandwidth of the machine, which scales as p2/3, but in reality, as the node count becomes very large, the message sizes in the FFT decrease to the point at which hardware and software latencies become significant [6].

We also need to discuss the contribution labeled P3ME Assign Charge in Figure 2(d). While its contribution is relatively small, its scaling is significantly worse than other parts of the P3ME reciprocal space calculation. The reason is fundamental to the P3ME algorithm itself, because each charge is distributed over several surrounding nodes on the FFT mesh. When this mesh is distributed over more processors, each particle begins to affect more processors, and more communication and/or replicated computation is required. While the scaling is an issue, we believe that further optimization of the P3ME Assign Charge operation is possible.

Finally, Figure 3 compares the performance of the Ewald and P3ME methods on the two systems described above. In the current implementations, the results on the smaller system are comparable, while on the larger system, the P3ME technique still has a significant advantage, although that advantage appears to decrease as the node count increases. At the higher node counts, the P3ME Assign Charge operation is taking a significant fraction of the k-space nonbond time and will be a target for further optimization.

Figure 3 Figure 3

It should also be noted that the compilation flags used for the two sizes of systems differed. We were forced to use -O21 for the compilations on the larger system in this scaling study, while we were able to use -O3 for the smaller system. The absolute performance on the 43,222-atom system when -O3 can be used is about 100 milliseconds per timestep (107 ns) at 512 nodes, about twice as fast as the performance measured with -O2 shown in Figure 3.

As we consider alternative parallel decomposition schemes, it is possible to eliminate the dependency that both methods share on the nonscalable communications operations, position globalization, and global force reduction. However, the global floating-point reduction required by the Ewald technique to compute the structure factors S(q) does not appear to be easily dispensed with, and this may handicap the Ewald technique in future comparisons. One should note that the replication of position and force data on each node also limits the size of the molecular system that can be handled, since the memory footprint required for the replicated data is 6N sizeof(double), where N is the number of particles.

Summary and future directions

We have presented a snapshot of early performance results on a molecular dynamics application, Blue Matter, that has been developed as an adjunct to the Blue Gene science mission [1]. Using lightweight trace points, we are able to measure the scalability of individual contributions to the execution time of the code, which has allowed us to identify key MPI collective operations whose further optimization will improve both absolute performance and scalability.

As a first step in exploring algorithmic alternatives for the efficient evaluation of electrostatic interactions on massively parallel computers, we have compared the performance of our implementations of Ewald and the P3ME techniques on two system sizes. At large node counts, the performance of the Ewald technique is comparable to that of P3ME for the smaller system size, while, as expected, P3ME shows a significant advantage for larger system sizes. Both of these approaches will benefit from additional improvements in the implementation of the MPI floating-point reduction collective. Our planned explorations include alternative parallel decompositions to reduce or eliminate the nonscalable collective operations currently in use. Further explorations of algorithmic alternatives for the long-range interactions may include the fast multipole method [17] as applied to periodic systems [18]. The results reported here were taken in a mode in which only one of the two PowerPC 440 CPUs on each node were used for computation and without any attempt to have the compiler generate code for the double floating-point unit. Significant improvements in the absolute performance of the compute-intensive contributions to the total timestep duration should be obtainable when both CPUs can be used and as the compiler code generation for the double floating-point unit improves.

Acknowledgments

We acknowledge the contributions of others who have participated in the development of the Blue Matter code, including Michael Pitman, Jed Pitera, Yuk Sham, William Swope, and Ruhong Zhou. We also acknowledge the contributions of the Blue Gene/L hardware and system software teams whose efforts and assistance made it possible for us to use the Blue Gene/L prototype hardware.

*Trademark or registered trademark of International Business Machines Corporation.

References


Footnote

1-02, -03 are compiler options that control how aggressively the code is optimized (sometimes the optimizations can change the order of operations or possibly the semantics of the code).

Received July 29, 2004; accepted for publication October 21, 2004; Published online April 12, 2005.


    About IBMPrivacyContact