IBM Skip to main content
  Home     Products & services     Support & downloads     My account  
  Select a country  
Journals Home  
  Systems Journal  
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Author's Guide  
Journal of Research
and Development
  Staff  
  Contact Us  
Systems Journal  
Volume 34, Number 2, 1995
Scalable Parallel Computing
 Table of contents: arrowHTML      DOI: 10.1147/sj.342.0152 arrowCopyright info
   

SP2 system architecture

by T. Agerwala, J. L. Martin, J. H. Mirza, D. C. Sadler, D. M. Dias, and M. Snir
Scalable parallel systems are increasingly being used today to address existing and emerging application areas that require performance levels significantly beyond what symmetric multiprocessors are capable of providing. These areas include traditional technical computing applications, commercial computing applications such as decision support and transaction processing, and emerging areas such as "grand challenge" applications, digital libraries, and video production and distribution. The IBM SP2™ is a general-purpose scalable parallel system designed to address a wide range of these applications. This paper gives an overview of the architecture and structure of SP2, discusses the rationale for the significant system design decisions that were made, indicates the extent to which key objectives were met, and identifies future system challenges and advanced technology development areas.
The IBM SP2* is a general-purpose scalable parallel system based on a distributed memory message-passing architecture. Generally available SP2 systems range from 2 to 128 nodes (or processing elements), although much larger systems of up to 512 nodes have been delivered and are successfully being used today. The latest POWER2* technology RISC System/6000* processors are used for SP2 nodes, interconnected by a high-performance, multistage, packet-switched network for interprocessor communication. Each node contains its own copy of the standard AIX* operating system and other standard RISC System/6000 system software. A set of new software products designed specifically for the SP2 allows the parallel capabilities of the SP2 to be effectively exploited.

Today, SP2 systems are used productively in a wide range of application areas and environments in the high-end UNIX** technical and commercial computing market. This broad-based success is attributable to the highly flexible and general-purpose nature of the system. This paper gives an overview of the architecture and structure of SP2, discusses the rationale for the significant system design decisions that were made, indicates the extent to which key objectives were met, and identifies system challenges and advanced technology development areas for the future.

We first discuss the overall goal of the SP2 system and the key focus areas. Next we discuss our rationale for the systems approach we have selected and which we will refine over time to meet these requirements. This is followed by a discussion of the overall SP2 system architecture, some of the major system components, and the SP2 performance. We conclude with our views on the key challenges facing system architects of scalable parallel systems and areas in which we need to focus in the future, and a summary of how the SP2 systems are being used today.

Design goals

Massively parallel processors (MPPs) have been around for a number of years. These systems have typically been designed to apply the combined capacity of hundreds and even thousands of low-cost, low-performance processing elements for solving single large problems. However, until recently these systems were not adopted for mainstream supercomputing applications. Since the individual processors gave very low performance, considerable effort was required up front to parallelize an application code sufficiently (that is, divide the code into multiple parts that can execute in parallel) even to get performance equivalent to the mainstream uniprocessors. That was a major inhibitor. In addition, limited processor memory, limited input/output, poor reliability, primitive nonstandard software development environments and tools, and programming models that were closely tied to the underlying hardware (such as the interconnection structure) all contributed to their failure to be generally accepted. MPPs remained, at best, special-purpose machines for very narrow niche applications.

From the inception of the SP2 project, our goal was to design general-purpose scalable parallel systems. We realized (as did others [1]) that for massively parallel systems to succeed they must be more general-purpose and less intimidating to use than they have been in the past. They must also provide all the capabilities available on a traditional system, at similar or lower price/performance. The basic nodes must be powerful enough and the underlying operating system must have full function so that users can move their current work over to the system with little effort and run their current applications in serial mode with acceptable performance. The systems must support familiar interfaces, tools, and environments, support existing standards and languages, and have common applications available. In this way, users can begin productive use of the system with little upfront effort and gradually parallelize and optimize the code over time. In addition, the system must also provide support in key areas to enable customers to grow (or scale) their applications in size and performance beyond what can be achieved on conventional systems.

Consequently, we have designed our systems to be used in a variety of environments. These include very large, stand-alone configurations dedicated to solving extremely complex and large single applications, smaller systems that coexist with mainframes and traditional supercomputers and that are used to offload some of the work for price/performance reasons, and consolidated servers for midrange local area network server environments.

The scalable parallel capabilities of the SP2 system allow customers to scale their applications, both in computation and data, much beyond what is possible with conventional systems. Our initial focus with the earlier SP1* was on high-performance scientific and technical computing in areas such as computational chemistry, petroleum exploration and production, engineering analysis, research, and "grand challenge" problems (those important for national interest). Today, SP2 systems address those areas and are also being used increasingly for commercial computing--primarily for complex query, decision support, business management applications, and on-line transaction processing. Over time we expect SP2 systems to be used for emerging applications such as large information servers, digital libraries, personal communications, video-on-demand, and interactive television, as well as for mission-critical applications for business operations such as airline reservations and point-of-sales.

In order to properly address these diverse applications and environments, we realized that we needed to focus the design on three key areas: programming models, flexible architecture, and system availability.

  • Programming models--The SP2 must support key programming models prevalent in the technical and commercial computing environment so that existing applications can be readily ported (or migrated from another processor) to the SP2. These models are discussed in more detail later in this section.
  • Flexible architecture--The SP2 must be flexible in how it can be configured and how it can be used. The system must be scalable from a very low entry point to a very large system, and be able to do this in small increments. The nodes must be individually configurable for hardware and software to meet the specific requirements of the customer's application and environment. The system must support a multiuser environment with a mix of serial and parallel, and batch and interactive jobs, and must accommodate a mix of throughput and job turnaround time requirements.
  • System availability--In order to succeed commercially, the SP2 cannot merely be a research machine; it must exhibit good reliability and availability characteristics so that customers can run their production codes on it. Points of catastrophic failures must be removed and failures must be isolated to the failing component and not be allowed to propagate. The system must support concurrent and deferred maintenance for the most common service situations and must support concurrent upgrade for the most common system upgrade situations. Finally, critical hardware and software resources must be designed for transparent recovery from failures.

Programming models in technical computing. The availability of software applications from vendors is critical to the success of any technical computing system. There is a large number and variety of such applications, and it is important to make it as easy as possible for software vendors and customers to port their applications to the SP2.

There is a significant number of applications available today for the RISC System/6000, and we must preserve the execution environment for these applications so that they can continue to run serially on an SP2 node without requiring any modifications.

In addition, key technical applications must be able to execute in parallel. To facilitate this, the system must provide support for prevalent parallel programming models and styles, and provide a comprehensive set of tools and environments (for both FORTRAN and C) for the development of new parallel applications, the porting of existing parallel applications, and the conversion of existing serial applications.

There are essentially three parallel programming models that are being used in large scalable systems (see Figure 1), the message-passing programming model, the shared-memory programming model, and the data parallel programming model.

Message-passing programming model. With the explicit message-passing model, processes in a parallel application have their own private address spaces and share data via explicit messages; the source process explicitly sends a message and the target process explicitly receives a message. However, since data are shared by explicit action on the part of the processes involved, synchronization is implicit in the act of sending and receiving of messages. The programs are generally written with a single-program, multiple-data (SPMD) stream structure, where the same basic code executes against partitioned data. Such programs execute in a loosely synchronous style with computation phases alternating with communication phases. During the computation phase, each process computes on its own portion of the data; during the communication phase, the processes exchange data using a message-passing library.

Shared-memory programming model. With the shared-memory model, processes in a parallel application share a common address space, and data are shared by a process directly referencing that address space. No explicit action is required for data to be shared. However, process synchronization is explicit; since there are no restrictions on referencing shared data, a programmer must identify when and what data are being shared and must properly synchronize the processes using special synchronization constructs. This ensures the proper ordering of accesses to shared variables by the different processes. The shared-memory programming model is often associated with dynamic control parallelism, where logically independent threads of execution are spawned at the level of functional tasks or at the level of loop iterations.

Data parallel programming model. The data parallel model is supported by a data parallel language such as High Performance FORTRAN. [2] Programs are written using sequential FORTRAN to specify the computations on the data (using either iterative constructs or the vector operations provided by FORTRAN 90), and data mapping directives to specify how large arrays should be distributed across processes. A High Performance FORTRAN preprocessor or compiler then translates the High Performance FORTRAN source code into an equivalent SPMD program with message-passing calls (if the target is a system with an underlying message- passing architecture), or with proper synchronizations (when the target system has a shared-memory architecture). The computation is distributed to the parallel processes to match the specified data distributions. This approach has the advantage of freeing the user from the need for explicitly distributing global arrays onto local arrays and changing names and indices accordingly, allocating buffers for data that must be communicated from one node to another, and inserting the required communication calls or the required synchronizations. Another advantage is that High Performance FORTRAN source code is compatible with regular FORTRAN (since syntactically, directives are comments), so that code development can occur on ordinary workstations and porting code from one processor to another is easier.

To date, the primary focus for programming large scalable parallel systems has been on the explicit message-passing and data parallel models, and our current emphasis is on the efficient support of these models. For the explicit message-passing model, we must support the prevalent and emerging message-passing libraries efficiently. For the data parallel model, we must provide High Performance FORTRAN language support. Support of these programming models and an easy-to-use program development and execution environment are critical to encourage software vendors and users to invest the effort necessary to exploit the parallel capabilities of scalable parallel systems such as SP2.

Both explicit message-passing and data parallel models encourage a fairly static (declarative) distribution of data. As programmers become more sophisticated in the use of large scalable systems, we expect that parallel numerical algorithms in many disciplines will increasingly focus on sparse, irregular data structures, and dynamic distribution of data and computation to nodes. In the future, SP2 must also support a shared-memory programming model to enable this evolution. Improvements in compiler technology and in communications hardware and software will be necessary to enable support of this model on a system with an underlying distributed memory message-passing architecture. (Note that we are making a distinction here between the underlying system architecture and the supported programming style or programming model. It should be fairly evident that with the correct software and hardware support, any of the programming models can be supported on a system with either of the underlying architectures.)

Programming models in commercial computing. "Commercial computing" is a broad term that has many different connotations. For our purposes, by commercial computing we will refer largely to online transaction processing (OLTP), database query processing, and related emerging applications such as data mining and very large information servers. Such commercial applications in the UNIX environment are largely based on a few key subsystems. Database subsystems include DB2/6000*, [3, 4] Oracle, [5] Sybase, [6] Ingres, and Informix. [7] Transaction monitors include CICS/6000*, [8] Encina, [9] and Tuxedo. [10]

Porting these few primary subsystems to run in parallel on a scalable parallel system provides the basis for enabling a host of applications that utilize these subsystems. Many commercial applications do not need to be modified to run in a parallel environment since they utilize and request services from a few key subsystems. Instead it is these subsystems that need to be enabled and optimized for parallel execution or for throughput. The reason is that for many applications, the bulk of the processor time is spent executing function in the subsystems; thus optimizing the subsystem performance is the key aspect. Only sophisticated applications that provide considerable functionality over and beyond the underlying subsystems need to be specifically modified and tuned for parallel execution on scalable parallel systems.

These key subsystems mentioned above have all been enabled to run under the UNIX operating system. They were initially developed for high-volume single-processor systems, but most have been modified to run in a multiprocessor environment. To provide performance, capacity, and availability beyond symmetric multiprocessors, these subsystems are also being enabled to run in a clustered systems environment; in this environment a separate instance of the subsystem runs on each of the systems in the cluster, and a layer of software ties these instances together to provide a single system image to higher level application software.

There are two principal clustered system programming models for parallel transaction and query processing, as illustrated in Figure 2. In the function shipping model [11] (also referred to as the sharednothing model [12]) the data are physically partitioned among the nodes in the cluster, and remote function calls are made to access remote data. In the data-sharing model, [13-15] the data are shared among the nodes of the cluster. One option is to provide a direct physical connection from all nodes to all devices storing the database (for example via multitailed devices). Alternatively, the data may be logically shared among the nodes, but are physically partitioned; in this case the remote data are shipped to a requesting node either at the database level, referred to as data shipping, [16] or at the input/output device driver level, which we refer to as virtual shared disk (VSD) [17] and further describe in a later section. The software that ties the instances together routes transactions to provide load balancing and affinity routing in the data-sharing case, or routes them based on the locality of data in the function shipping case. Complex queries are divided into individual steps that can be executed in parallel to reduce the turn-around time of a query. For the data-sharing model, a fully distributed "global lock" manager must also be provided.

The critical step to enable commercial applications to run on a scalable parallel system is to support both of these fundamental programming models efficiently and to optimize the execution of the various parallel subsystems. In this sense, the solution is actually less complex than the technical computing area in which most of the individual applications have to be individually enabled for the scalable parallel environment.

System strategy

Various system approaches to scalable computing are being pursued commercially and are being investigated in academic environments. Scalable systems available today include the AT&T 3600 [18] (formerly Teradata Corporation) and the Tandem Computers Inc. Himalaya** [19] in the commercial computing arena, and the Cray Research, Inc. T3D, [20] the Convex Computer Corporation SPP1000, [21] and the Intel Corporation Paragon [22] for scientific and technical computing. Academic research covers a broad range of different areas of investigation; these include how to improve the scalability of shared-memory multiprocessors, [23] what architecture support is required for low-latency communication and fine-grain computing, [24-27] and how to support efficient parallel programming over networked workstations. [28, 29]

In designing the SP2 as a flexible, general-purpose scalable parallel system, we followed a set of guiding principles that are discussed below. We arrived at these principles after analyzing the current technology trends in both hardware and software, and understanding the requirements in the different application areas and customer environments we expected to address.

Principle 1. A high-performance scalable parallel system must utilize standard microprocessors, packaging, and operating systems.

Major technology advances in recent years have primarily come from the workstation and distributed systems marketplace. High volumes and competitive pressures in that marketplace have prompted significant investments, resulting in significant advances being made in all aspects of the technology--processors, input/output technology, communications technology, compilers, system software, tools, and applications. It is generally accepted that microprocessor performance is doubling roughly every 18 to 24 months. This is being accomplished through a combination of superscalar designs, faster and more dense CMOS (complementary metal oxide semiconductor) technologies, architecture improvements that take advantage of the increased gate counts, and improved compiler optimizations that use these improvements. No fundamental limitations are expected over the immediate future, and processors with speeds in the hundreds of megahertz are being designed in the community. Furthermore, tightly packaged symmetric multiprocessors offer the opportunity for even greater improvements in node price/performance.

Figure 3 shows a least squares fit through the performance and price (over the past 10 years and extrapolated over the next few years) of microprocessor-technology-based processors used in MPP systems and custom-designed processors used in traditional vector supercomputers and mainframes. As the figure shows, the performance of the two is rapidly converging, while the price is diverging.It is our contention that specialized microprocessors for scalable, high-performance computing will be unable to keep up with the rate and pace at which the "commodity" microprocessors will evolve and improve. Therefore, our design approach is to "ride" the microprocessor technology curve; we will use standard components (both hardware and software) from the workstation environment as much as possible, and develop custom hardware and software only where standard technology cannot meet some unique requirements of a scalable parallel system at the desired performance levels.

Principle 2. Time-to-market with the latest technology is critical to achieving leadership performance and price performance.

The rate of technology improvements mentioned above creates both an opportunity and a challenge. Since performance and time can essentially be traded, it is imperative that the SP2 systems be able to incorporate the latest microprocessor technology very rapidly. This emphasizes the need to exploit this technology essentially as is, and place as few dependencies as possible on our technology suppliers for special features to support parallel processing. It also has an implication for the underlying system architecture; it must be flexible enough to allow rapid exploitation of the latest hardware and software technologies without requiring time-consuming enhancements or modifications. This implies a relatively loose coupling between the nodes at the operating system level.

Principle 3. Required levels of latency (small multiples of memory access time) and bandwidth (small submultiples of memory bandwidth) will require custom interconnected networks and communication subsystems over the next few years.

For parallel applications, a key determinant of performance is the process-to-process communication latency and bandwidth and the corresponding overhead on the processor for executing the communications protocol. In typical scalable parallel systems today, the sustainable pair-wise interprocessor communication bandwidth for large messages is typically several tens of megabytes per second, and the latency for short messages is of the order of a few tens of microseconds. Systems with global real shared-memory architecture can typically transfer a cache line amount of data from remote memory at even lower latencies. Significant improvements of up to an order of magnitude will be required in the future as the individual nodes improve in performance.

While several interesting "commodity" network technologies (such as Fiber Channel Standard [FCS] and Asynchronous Transfer Mode [ATM]) have recently emerged, these alternatives are optimized for a very different environment and do not provide the correct levels of latency, bandwidth, and processor overhead to meet the stringent performance requirements of parallel systems. For example, ATM networks are different from networks for scalable parallel systems in that the technology is optimized for communication between heterogeneous systems across great geographic distances, there is no guaranteed delivery or flow control in the low-level protocols, and there is no protection implemented at low levels. High-level protocols provide these functions and imply higher latencies. Interconnection networks in scalable parallel systems optimize for these functions at the lowest levels, and we believe that Principle 3 will therefore continue to be valid for low-latency as well as for high-bandwidth environments. In either case, standard network technologies with special software will support high-bandwidth communications to external devices at tens-of-microsecond latencies. This will allow scalable parallel systems to more effectively utilize network resources for a variety of tasks such as input/output, storage management, and some forms of computation.

Principle 4. The system must support a programming and execution environment identical to a standard open, distributed UNIX environment.

Figure 4 shows the full stack of software (explained in the rest of this section) that is required for enabling various technical and commercial applications to run. It is not feasible to develop unique new software for all or even the bulk of the components in the stack specifically for a scalable parallel system. Much of the software for systems management, job management, storage management, databases, and message-passing libraries exists for distributed UNIX environments. Our goal is to accommodate and depend on this software. This support provides one of the dominant "personalities" of the system and allows software written for a distributed UNIX environment and available for the underlying base node to run on the SP2 machine.

Principle 5. The system should provide a judiciously chosen set of high-performance services in areas such as the communications system, highperformance file systems, parallel libraries, parallel databases, and high-performance input/output to provide state-of-the-art execution support for supercomputing, parallel query, and high-performance transaction solutions.

Scalable parallel systems must provide a second dominant personality for the high-performance supercomputing environment. This consists of a set of high-performance services, and a development environment with tools to enable, develop, and execute new parallel applications and subsystems that cannot execute efficiently in conventional distributed system environments.

The combination of Principles 4 and 5 allows us to overcome a significant limitation of prior highly parallel solutions and dispel a commonly held misconception that massively parallel machines can provide only niche solutions. In fact, it is our contention that scalable parallel systems can provide the most general-purpose solutions. Principle 4 supports the execution of all distributed open systems software and Principle 5 at the same time provides competitive solutions for traditional MPP grand challenge (national interest) and high-performance commercial applications.

The first five principles lead us to the high-level system structure shown in Figure 5. The nodes consist of robust, high-function, high-performance RISC System/6000 processors, each running a full AIX operating system. The nodes are interconnected by a High-Performance Switch through communication adapters attached to the node input/output bus (the microchannel). For the current systems, using the microchannel as the interface to the switch subsystem was a practical decision; the standard microchannel interface allows us to rapidly introduce new node technologies into the system while achieving the target goals for latency and bandwidth.

A full AIX image on each node, together with support for standard communication protocols on the switch (i.e., Internet Protocol), provide full logical support of all standard distributed services. The core of the high-performance services on the SP2 is provided by a high-performance interconnection network, an optimized communications subsystem software and a parallel file system implemented as kernel extensions to AIX, and a parallel program development and execution environment. This system architecture allows us to achieve state-of-the- art performance, price/performance, and scalability in supercomputing environments.

Principle 6. Desired system availability can be costeffectively achieved with standard commodity components by systematically removing single points of failure that make the entire system unusable, and by providing very fast recovery from all failures.

The structure described so far consists to a large extent of commodity components that are produced for workstations rather than large system environments. In a very large system with hundreds or thousands of commodity parts, failures in the node hardware, node software, and switch will occur frequently enough so that the system must be designed to continue functioning in the presence of failures.

The distributed operating system architecture has some inherent advantages over symmetric multiprocessors. The failure of an operating system image does not have to disable the entire system since the other operating system images can continue to function. Our system approach to high availability relies on this.

This approach requires that the system be configured with sufficient replication (of hardware and software components and data), and a software infrastructure for availability is provided. This infrastructure consists of a set of availability services for failure detection, failure diagnosis, reconfiguration of the system, and invocation of recovery action. The goal of these services is to allow a system to gracefully degrade from N resources to M resources (where M < N), and reintegrate the N - M resources later in a nondisruptive manner.

It should be noted that this is merely an infrastructure. To provide real benefit to an end user, all higher level subsystems such as the program development and execution environment, job scheduling, and database and transaction subsystems must exploit the N --> M --> N infrastructure and take the appropriate recovery actions nondisruptively.

Principle 7. Selected support for a single-system image through the globalization of key resources and commands, together with a single point of control for systems management and administration, is preferred compared to a true single-system image.

At the level just above the high-availability services, the software system view is that of N AIX images, each of which manages a set of local resources and provides a set of local services. A critical design decision is the level of a single-system image to be supported. Two extreme approaches are possible: the first is to stay with the totally distributed view; the second is to implement a layer of software that makes the N images appear to be one in all respects (a true single-system image).

This is a complex decision since different environments need different views. An interactive user would generally prefer a true single-system image. On the other hand, database subsystems have been written for a distributed environment and expect to see the totally distributed view; these subsystems explicitly manage the different images for performance, and provide a single-system image at the database subsystem level. Finally, for a technical computing user, a single-system image at the source code level and at the UNIX shell level is desirable.

Since providing a true (or complete) single-system image in an efficient manner is a complex undertaking and not a critical requirement in all environments, we have taken a more pragmatic approach. There are clearly key resources (such as disks, tapes, and directory services) that need to be globally known and accessible. Similarly, there are key commands that should be globalized for ease of use. Our approach is to stage in the globalization of these selective resources and commands over time, based on the critical requirements of the applications and subsystems we expect to support. Furthermore, our approach is to provide hardware and software support for the controlling, administering, and managing of N AIX images and nodes in an SP2 system from a single point (i.e., we will also provide a single-system image at the system management level). Similarly, we will provide a single-system image at the job management level so that a user can submit a job to the system as a whole, and the job management software can automatically select and allocate the required resources to the job from the set of resources available in the system at that time.

Above the global services layer in Figure 4 are a set of subsystems that are primarily built from standard distributed systems technology and tools, with extensions or modifications where necessary.

System overview

In this section we give a brief overview of the SP2 system architecture. We focus on high-level design choices that were made and, where appropriate, the rationale behind them or the implications of those choices.

System architecture. One of the fundamental decisions in the design of a parallel system is the underlying architecture. It is generally understood that symmetric multiprocessors with centralized memory and a single copy of the operating system are not scalable to beyond a small number of processors (typically up to a maximum of around 20 today). Furthermore, the single, system-wide operating system image is a critical single point of failure; an operating system failure can result in the loss of the total system.

In order to scale to hundreds of processors today (and thousands in the future), the SP2 is structured as a distributed memory machine. In such systems, a portion of the total system memory is packaged close to each processor. Access to local memory is fast and remains constant with the size of the system, while access to remote memory is slower. Scalable distributed memory machines can have one of two underlying architectures based on how data are shared--distributed shared-memory architecture or distributed memory message-passing architecture (Figure 6).

With distributed shared-memory architecture, a single global real address space exists across the whole system. All of the physical memory is directly addressable from any node, and a node can perform a load or a store instruction to any part of the real address space. This underlying architecture has the advantage that it generally makes it easier to efficiently support a shared-memory programming model (discussed earlier in the section on design goals). Typically there is a separate operating system (or micro-kernel) image on each node, but they are not independent; the different images are tightly connected at least at the virtual memory manager level so as to present a single global real address space. In such systems, address and data coherence must be maintained in hardware (which makes the hardware complex and costly, and is a fundamental limit to performance and scalability), or in software (which adds to programming or compiler complexity and program correctness exposures, and can potentially affect performance because of conservative coherence management actions).

Alternatively, with a distributed message-passing architecture, a processor has direct access (i.e., can perform load or store operations) to only its local memory. Remote memory is not directly addressable and data are shared by explicitly sending and receiving messages. Address and data coherence across nodes is not an issue here.

The SP2 is a distributed memory message-passing machine. Two primary reasons, and a host of secondary reasons, led us to select this architecture as opposed to the alternative distributed shared- memory architecture.

In the alternative distributed shared-memory architecture, a globally shared real (and we mean real as opposed to virtual) address space implies fundamental changes in the operating system running on these nodes--primarily in the virtual memory management area, but affecting other areas of the operating system. Requirement of such fundamental changes in the operating system would have been contrary to our guiding Principles 1 and 2.

Even more important is our contention that an underlying distributed memory architecture without global real memory addressability is the correct choice for cost-effective scalable parallel systems. Such systems are inherently more scalable at the system level because they do not require tight coordination between the operating system images on the different nodes to provide common address space management and maintain address coherence; nor do they require tight coordination at the hardware level to maintain data coherence. Further, message-passing structures with loose coupling between the operating systems have inherently more availability since it is easier to localize failures to the failing node. Finally, a message-passing architecture, if properly designed, allows for significant offloading of the overhead associated with communication and overlaps it with computation, especially when one can aggregate the data to be communicated into a few large messages.

In a system with a distributed shared-memory architecture, data are shared implicitly by merely referencing them, and the referenced data from a remote memory are accessed on a cache-line-by- cache-line basis. Although a single line miss may result in a latency and overhead that is very small compared to message-passing architecture, many such line misses would be required to share the data that could have been passed with just one large message.

Programming models. Our goal with the SP2 is to support as many as possible of the dominant programming models being used today in technical and commercial parallel processing, and continue to add others over time. Although introduced earlier in the section on design goals, here we discuss what support is provided for these models in the SP2 system.

For technical computing, our initial focus is on explicit message-passing and data parallel models. Because of the underlying message-passing architecture, clearly a message-passing programming style is the preferred one for performance on the SP2. Several message-passing libraries callable from FORTRAN and C are supported on the SP2. We provide MPL, [30] which is an advanced MessagePassing Library that is tuned and optimized for the underlying communication hardware of the SP2, and will soon support the emerging Message-Passing Interface (MPI) [31] standard. We also provide the PVMe, which is IBM's optimized version of the message-passing library used in Parallel Virtual Machines (PVM). [32]

The SP2 also supports the data parallel programming model with High Performance FORTRAN. [2] The Forge90 High Performance FORTRAN preprocessor from Applied Parallel Research is available today for the SP2, where the preprocessor converts a High Performance FORTRAN program into an SPMD structure using standard FORTRAN 77 and calls to the MPL. A High Performance FORTRAN compiler for the SP2 is expected to be available in the future. As mentioned earlier, High Performance FORTRAN relieves the programmer from many of the details of data partitioning and communications. In effect, it allows the user to program within a single global name space.

For commercial computing, the function shipping model (previously described in the section on design goals) fits the underlying distributed memory message-passing architecture of the SP2. The other dominant model is the data-sharing model and is supported on the SP2 via the virtual shared disk support (that will be described in the section on system components). Many of the popular databases, on-line transaction processing, and business application subsystems have been enabled to run on the SP2. The typical user does not have to do anything to exploit the underlying parallel capabilities of the system; instead, the application subsystem takes care of decomposing the individual queries to run on multiple nodes or of routing the transaction to the correct server.

Flexible architecture. The SP2 was designed to address a diverse set of application areas and customer environments. Consequently, it was of fundamental importance that the SP2 have a flexible architecture that allowed a customer to customize the system to site-specific requirements and situations.

Figure 7 shows two views of an SP2 system, illustrating its configuration flexibility and operational flexibility.

Configuration flexibility. The SP2 system consists of from 2 to 512 POWER2 Architecture* [33] RISC System/6000 processor nodes, each with its own private memory and its own copy of the AIX operating system, interconnected by a High-Performance Switch (Figure 7A). Each SP2 system also requires a control workstation that is a separate RISC System/6000 workstation that serves as the SP2 system console.

As per Principles 1, 2, and 4, we have made no fundamental change to the base RISC System/6000 processor and the AIX operating system. This means that any of the major hardware or software options available on the base RISC System/6000 workstations can be installed on an SP2 node. Similarly, several thousand RISC System/6000 applications are available immediately to an SP2 customer.

SP2 nodes can be configured to have one of two fundamental logical personalities. Some nodes are configured as compute nodes and are used for executing user jobs. Other nodes are configured as server nodes that provide various services required to support the execution of user jobs on compute nodes; these could be integrated file servers, gateway servers for external connectivity, database servers, backup and archival servers, etc.

The requirements for node performance and configurability can be very different for compute and server nodes, and for different application areas. As a result, SP2 provides three different physical node types--thin node, thin node 2, and wide node. These nodes differ in their configurability (memory and microchannel slots), performance characteristics, price, and physical size. Each node provides an optimal price/performance point in specific environments.

SP2 is designed for maximum configuration flexibility so that a system can be tuned in a cost-effective manner to a customer's particular requirements. The system can scale up over a very wide range (2 to 512 nodes) in very small increments (one or two nodes). There can be any mix of compute and server nodes within the system. Similarly, there can be any mix of thin and wide nodes (within limits of frame configuration options). Furthermore, each SP2 node can be individually configured for memory, adapters, internal hard disk, and software. This configuration flexibility is an important distinguishing characteristic of the SP2 system compared to other MPP systems.

Operational flexibility. SP2 systems will commonly be used as enterprise-wide or department-wide servers and therefore must accommodate a mix of job types (serial and parallel, interactive and batch), and accommodate throughput as well as response time requirements of applications. With a large user community sharing the system, the system must function as a throughput engine, concurrently processing unrelated user jobs; however, when response time is important, it must be possible to usurp required system resources and apply them to one or a few large, long-running jobs.

Figure 7B shows the operational flexibility of the SP2 systems in such an environment. Using the job management software, SP2 nodes can be partitioned into different pools for different classes of work-- interactive, batch serial, and parallel. Jobs of different classes are channeled to the corresponding pool by the job manager. The nodes within a pool may be shared by several jobs, or they may be dedicated to a single application. In the figure, each small square represents a processor node. Using the job management system software, an administrator can identify different sets of processors for different types of work. For example, a set of eight can be reserved for doing interactive work. Another set of eight can be kept aside for batch serial production jobs.

The figure also shows that some nodes may be set aside as server nodes; server nodes provide specific services that are required by applications running on the compute nodes. In the figure, there is a set of six general server nodes that could be used for specific services such as file servers and/or gateway services; another set of 10 nodes has been configured as a parallel database server.

Similarly, a number of nodes may be set aside for parallel work and form the parallel pool; when a parallel job is submitted, it requests a parallel job partition with the required number of nodes. The system software locates these nodes from the parallel pool and, if the requested number of nodes is available, they are allocated to the job and the job begins execution in that partition. There can be several parallel job partitions created as long as there are nodes available in the parallel pool. In the figure there are four hypothetical partitions of eight, four, six, and eight nodes each and there are still six unallocated nodes in the parallel pool available for another parallel job. As parallel jobs complete, they return the nodes they were using to the available pool.

Parallel programs generally execute in a dedicated partition of processors; this allows the individual processes to remain loosely synchronized and thus achieve good parallel speedup. It also allows the processes to use direct user-mode communication that bypasses the operating system, thus resulting in better performance. Multiple parallel jobs can be running on the machine simultaneously, each controlling a disjoint set of the nodes. Because of the topology of the interconnection network, there are no restrictions on the size of a parallel job partition or on the topological location of the nodes in the partition; this provides significant flexibility in application design as well as in job management and resource management. This is discussed further in the section on the communication subsystem.

Input/output architecture. The SP2 system provides an input/output (I/O) subsystem that scales in performance and capacity with the computation capability of the system.

Typically, an SP2 system will be part of an enterprise-wide network, and therefore must have accessibility to network data. In fact, data required by many technical applications are so large that the data are typically stored on less expensive archival storage somewhere in the customer network, and retrieved into the on-line storage only when required by an application. Similarly, because of high-availability requirements, commercial relational databases are typically copied periodically on to similar network backup and archival systems.

External servers connect to the SP2 via intermediate gateway nodes (G in Figure 8). The connection between the gateway nodes and the enterprise servers can be via a local area network, or LAN (Ethernet, token ring, or the fiber distributed data interface FDDI), or a high-speed interface such as high-performance parallel interface (HiPPI), Fiber Channel Standard (FCS), or ATM switches. The external servers provide I/O and file service in response to requests from SP2 compute nodes through a LAN-based distributed file system such as Network File System** (NFS**), Andrew File System (AFS), or Distributed File System (DFS). By using multiple servers and multiple gateway nodes, the aggregate I/O bandwidth can be scaled.

However, the scalability of such a solution is limited and it cannot meet the more stringent requirements of parallel applications for high-bandwidth, low-latency, and low-overhead access to a single shared data object. For high-performance I/O requirements, the SP2 allows I/O and file servers to be integrated into the system by configuring some of the nodes as I/O and file servers (I/O server nodes in Figure 8.). Raw I/O capacity and bandwidth can be arbitrarily increased simply by adding more I/O server nodes. With the I/O server nodes directly on the same "attachment fabric" as compute nodes, significantly higher bandwidths are possible than over LANs. More importantly, with clients and servers connected via a reliable switch subsystem rather than LANs, there is the opportunity in the future to move to a tuned, lightweight communications protocol for I/O and file system operations; this would considerably reduce processor overhead compared to the overhead associated with protocols (such as Transmission Control Protocol/Internet Protocol, or TCP/IP, and User Datagram Protocol/Internet Protocol, or UDP/IP) normally used over LANs. Finally, the I/O server node memory can provide sufficient buffer caching for improved I/O performance.

When applications are parallelized for speed improvements, their I/O bandwidth requirement from a single file also increases proportionately. Such bandwidths cannot be satisfied by standard network file systems, even with integrated I/O server nodes; the single server and the single link to the switch ultimately limit the maximum bandwidth that can be supported to a single file. SP2 expects to provide a parallel file system (discussed later) that allows files to be distributed (optionally under user control) across multiple I/O nodes to increase the bandwidth to that file from a parallel application.

Note that the integrated I/O server node architecture logically supports the function shipping model for commercial parallel computing discussed earlier; the clients on compute nodes ship the file requests to the servers on I/O server nodes. The data-sharing model is also supported by providing global access to disks connected to individual nodes as discussed later in the section on global services.

System components

Given the overall system architecture and features described in the previous section, we now look at some of the primary system components in more detail. These components have a direct impact on the essential character of the system--scalability, flexibility, its general-purpose nature, performance and price/performance, system availability, and usability.

Processor nodes. A first-order determinant of performance for both serial and parallel jobs is the individual node performance. Our decision was to use the fastest and most robust processors available to us. This way users can get reasonable performance on their existing serial applications without ever modifying them to run in parallel.

The SP2 nodes are standard POWER2 Architecture RISC System/6000 processors. [33] They are superscalar (issuing and executing multiple instructions concurrently every cycle) rather than merely superpipelined (running the processor at very high frequency), and incorporate many sophisticated organizational techniques to achieve high sustainable performance. As previously discussed in the section on configuration flexibility, the SP2 provides three physical node types--wide node, thin node, and thin node 2. The structure for these is shown in Figure 9.

Both the wide and the thin nodes have two fixedpoint units, two floating-point units (each capable of a multiply-and-add every cycle) and an instruction and branch control unit. A 66.7 megahertz (MHz) clock speed gives the processors a peak performance of 267 million floating-point operations per second (MFLOPS). The instruction unit can decode and issue multiple instructions every cycle to keep all the units busy. This is matched with a large-capacity high-performance memory hierarchy. For example, in the SP2 wide node, the memory can be up to 2 gigabytes (GB) and has a bandwidth of 2.1 GB per second; a 256 kilobyte (KB) four-way set associative cache can supply data at the rate of four 64-bit operands per cycle to the floating-point units. As a result, within a tight loop one can effectively sustain performance equivalent to a two-pipe vector processor (two loads/stores, two floating-point multiply-and-add, index increment and conditional branch every cycle). Features such as short instruction pipelines, sophisticated branch prediction techniques, register renaming techniques, large cache, large memory, and high cache and memory bandwidth all add up to a very robust processor capable of good performance on both vectorizable as well as scalar code. As a result, the processor consistently sustains a very high percent of its peak capability. The SP2 thin nodes are similar to the wide nodes but have a less robust memory hierarchy and I/O configurability.

The robustness of the nodes, both in configurability and performance, is a primary advantage of the SP2 system. In fact, on a processor-to-processor basis, we believe that the SP2 has the most powerful node in any contemporary scalable parallel system; this contributes greatly to the overall SP2 system performance.

Communication subsystem. Because of the high request rate, communication between processes in a parallel job must be high bandwidth and low latency. Traditional AIX (or UNIX) interfaces cannot provide the required performance. As per Principles 3 and 5, SP2 provides a communication subsystem as one of the high-performance services to match the performance requirements for communications within a parallel job. Further, since the communication subsystem is the heart of the system, we paid special attention to making it reliable and able to recover from most failures automatically (and transparently to the applications).

The SP2 nodes are interconnected by a High-Performance Switch [34] designed for scalability, low latency, high bandwidth, low processor overhead, and reliable and flexible communication between the nodes.

The topology of the switch is an any-to-any packetswitched, multistage or indirect network similar to an Omega network. [35] This allows the bisection bandwidth to scale linearly with the size of the system, which is critical for system scalability. In contrast, the bisection bandwidth of direct networks (such as rings, meshes, or multidimensional toroids) increases much more slowly (or not at all, as in the case of simple rings).

A consequence of the High-Performance Switch topology is that the available bandwidth between any pair of communicating nodes remains constant irrespective of where in the topology the two nodes lie. This is not the case with direct networks. In general, the bandwidth available to a node is equal to l × k / h, where l is the link bandwidth, k is the number of links in the switch per node, and h is the average number of hops through the switch required for a communication operation. In a multistage network such as the High-Performance Switch, the average number of hops, h, and the number of links per node, k, both scale logarithmically with the number of nodes. Thus the bandwidth between any communicating pair of nodes remains constant. This lends considerable flexibility in programming and managing such systems.

In contrast, in a direct network the number of hops, h (for a random communication pattern), increases with the number of nodes, but the number of links per node, k, stays constant. As a result, the average bandwidth per pair of communicating nodes decreases. Because of this problem, programmers have an additional optimization parameter to worry about; they must carefully design the application so as to minimize h by trying to limit internode communications to close neighbors. Additionally, the resource manager and job scheduler must be designed to allocate nodes for a parallel application that are topologically in close proximity with each other, and carefully map the processes to the nodes so as to minimize the average number of hops required by the internode communication in the application. These optimizations are not critical with a topology such as that of the High-Performance Switch.

Figure 10 shows the switch structure for a 64-way SP2. The switch is designed to be scalable, with the building block being a two-staged 16 × 16 switch board, made up of 4 × 4 bidirectional crossbar switching elements. Each link is bidirectional and has a 40-megabytes-per-second bandwidth in each direction. The switch uses buffered cut-through worm-hole routing for maximizing performance. [34] In small systems (up to 64-way) only one switch board is required per 16 nodes. For the 64-way system shown in the figure, the required switch boards are packaged within the processor frames (one per frame) and connected via interframe cables to get a four-stage switch network. A frame is the "box" that houses up to 16 nodes and one switch board. Additional switch stages are required for larger systems. The extra switch boards for these additional stages are packaged in a special switch frame with up to eight switch boards per frame.

An SP2 node connects to the switch board through an intelligent Micro Channel* adapter. The adapter has an onboard microprocessor that offloads some of the work associated with moving messages between nodes. The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus reducing the overhead on the processor node for message processing and significantly improving the sustainable bandwidth. It also provides protection support for secure usermode communication between nodes within a parallel partition, without requiring system calls. This allows lower application-to-application message latency by avoiding kernel calls. Message cyclic redundancy check (CRC) code generation and checking is also done by the adapter, further reducing the overhead on the SP2 node. Normal message passing between the processor node and the adapter is driven by polling to avoid the overhead of interrupt processing. The communication subsystem allows shared use of the communications system by both user and kernel tasks; both userspace and Internet Protocol traffic is concurrently supported over the switch.

The communication subsystem hardware and software are designed for reliability and transparent recovery from hard and soft failures.

The switch hardware is fully checked. Each switching element is in fact shadowed by a duplicate switching element so that any error in the switching elements is detected. Similarly, packets on all links carry CRC codes; the code is generated at one end and checked at the other to detect errors in the links.

The switch always contains at least one stage more than necessary for full connectivity. Since the basic switching element is a 4 × 4 bidirectional crossbar, this extra stage guarantees that there are at least four different paths between every pair of nodes. The redundant paths provide for recovery in the presence of failures (as well as reduce congestion in the switch). In the presence of hard errors in the switch (due to a failing switching element or link), the switch can be reinitialized and the routes between nodes regenerated to avoid the failing components.

The communication subsystem software complements the hardware capability to provide transparent recovery of lost or corrupted messages. The communication protocol supports end-to-end packet acknowledgment. For every packet sent by a source node, there is a returned acknowledgment after the packet has reached and been received by the destination node. Thus the loss of a packet will be detected by the source node. The communication subsystem software automatically resends packets if an acknowledgment is not received within a preset interval of time.

When an error is detected by the switch subsystem hardware, the switch enters the diagnostic mode, during which the cause of the error is identified. Thereafter new internode routes are generated, avoiding any failing component or link that was detected, and then the switch is brought back into run mode. At this time, the communication subsystem retransmits the packets that were lost in transit when the failing event occurred. Thus the full error-detection support of the hardware, coupled with the error recovery capability of the software, results in a very reliable and robust communication subsystem for the SP2.

Parallel environment. The SP2 AIX Parallel Environment [36] is an integrated set of components that allow a user to develop, debug, and tune parallel FORTRAN or C programs, and to initiate, monitor, and control their execution.

The application of Principles 5 and 6, and to some extent Principle 2, resulted in a decision to base the Parallel Environment as much as possible on standard AIX tools and techniques and conform to established standards. This allows us to reduce the learning time and maximize ease-of-use for customers. Most commands use familiar UNIX syntax and various AIX tools are made available for use with the Parallel Environment. Program compilation, scheduling, execution, and monitoring are done in manners familiar to UNIX programmers.

The Parallel Environment has four primary components: the Message-Passing Library, the Parallel Operating Environment, the Visualization Tool, and the Parallel Debugger.

Message-Passing Library. The parallel MessagePassing Library (MPL) [29] is an advanced communications library that supports the explicit messagepassing model for FORTRAN or C programs. It provides a rich and comprehensive set of functions and subroutines for implementing simple pair-wise communication between processes, as well as more powerful collective communications operations involving user-definable groups of processes.

The MPL is implemented to exploit the SP2 HighPerformance Switch using an optimized, lightweight user-space protocol that does not require a kernel call. Alternatively, a user can elect to have the MPL run using the TCP/IP protocol over a local area network connection or over the switch.

Parallel Operating Environment. The Parallel Operating Environment (POE) provides a user environment for initiating, monitoring, and controlling the execution of a parallel application. It can be used to (1) compile and link parallel code with message-passing libraries, (2) create a parallel partition with the required nodes (nodes can be either explicitly specified by the user or selected by the SP2 resource manager based on user-specified job requirements), (3) load the parallel job on the nodes in the parallel partition, and (4) communicate with and monitor the job while it is executing. The parallel application is controlled by a Partition Manager process created by POE on the node or workstation that was used to initiate the application.

The Visualization Tool. The Visualization Tool (VT) provides performance monitoring and trace visualization for a parallel application. Performance monitoring displays system activity in real time while the application is running. Trace visualization is a postmortem process that allows the user to view in detail the interactions between parallel processes, using traces collected during run time. VT can be used to debug an application by identifying deadlock situations and analyzing interprocess communications, and it can be used to analyze and tune a parallel application by identifying performance bottlenecks and load imbalances.

The Parallel Debugger. The Parallel Debugger is a source-level debugger with both a command line and a graphical interface. It is an enhancement of the familiar UNIX dbx debugging tool and incorporates additional functions specific to parallel program debugging.

The AIX Parallel Environment is soon expected to provide support for an optimized version of the MPI message-passing library.

Parallel file system. Standard UNIX distributed file systems (e.g., NSF, ASF, and DFS) do not satisfy the extremely high bandwidth to file data required by some data-intensive applications. A parallel file system (PFS) [37] is one of the high-performance services created for SP2 to address the unique requirements of supercomputing applications that require scalable high bandwidth to an individual file from a parallel application.

PFS supports parallel access (from an application running in parallel on multiple compute nodes) to a file that is striped (i.e., the data are distributed) across multiple PFS server nodes. It allows the user to control the physical layout of parallel files when they are created--that is, the user can specify how the data in the file are to be distributed into physical partitions across the PFS server nodes. In addition, it allows each process within a parallel application to open a different subfile (or logical file partition) within a parallel file. Thus, processes may open disjoint partitions of a parallel file, and this partitioning may be changed during program execution without changing the physical layout of the file. Processes may also share access to the same subfile (or to the entire file). Accesses to shared files are guaranteed to be atomic and serializable. A scalable serialization protocol is used that does not require locking of the file or parts of it whenever it is accessed. PFS supports files greater than 2 GB in size, and asynchronous I/O. Standard UNIX file system calls (vnode interface) are supported by PFS, and additional capabilities of PFS are made available to users via the UNIX ioctl operation.

Access to the PFS is provided by a PFS client that runs on each compute node. In addition, each PFS server node runs a PFS server. These server processes communicate among themselves and collectively form a parallel server that satisfies requests of PFS clients.

When a PFS (sub)file is opened, the PFS client is provided information about the physical layout of this file. Any subsequent access to this file results in direct communication from the client to any servers that hold parts of the accessed file, with no additional communication to a manager holding meta-data. Each write access results in exactly two messages between the client and each involved server (data are sent and an acknowledgment is received); each read access results in exactly three messages. Data are cached (buffered) only at the PFS server nodes so that no communication is needed for cache coherence. Atomicity and serializability are guaranteed by a protocol executed by the PFS server nodes; this guarantees that accesses that belong to the same transaction are seen in the same order at each server node.

Availability services. The availability services provided by the SP2 system and described below are today used by a limited and controlled set of subsystems. In the future, as the application programming interfaces and the function of these services stabilize, our intention is to make them more generally available. These services would then form a scalable infrastructure that can be used by software developers to build recoverable subsystems and servers for the SP2. As discussed earlier in the section on Principle 6, by using these services, critical subsystems can be designed to gracefully degrade from N resources to M resources (where M < N) without disruption of service; later, the N - M resources can be reintegrated into the system in a nondisruptive manner after they have been serviced.

Today, SP2 availability services provide for "heartbeat," membership, notification, and recovery coordination.

Heartbeat services allow processors in the system to be monitored for normal operation.

Membership services allow processors and client processes to be identified as belonging to some defined group; a group is generally formed to include members providing some known and related application service.

Notification services allow members of the group to be notified when membership in the group changes. Membership within a group changes as processors join or leave the group as a result of various node events, such as node restart, shutdown, or a failure.

Recovery coordination services provide a mechanism for initiating recovery procedures within the active group in reaction to node events that change the membership. These services are used to coordinate the running of recovery procedures across the members of the currently active group in response to membership changes.

These services are implemented via daemons (unattended programs set up to execute periodically or by the occurrence of an event) running on the nodes. The heartbeat daemons exchange periodic heartbeat messages to determine which nodes are up.

In the event of a failure, the surviving members of the group are notified of the event, and the specific recovery action or procedure is initiated. The recovery procedure is specified by the subsystem that is to be recovered.

Currently these services are used to varying degrees by some of the SP2 system software components (such as the virtual shared disk described later) as well as one of the third-party (supporting vendor) application subsystems. As the application programming interfaces and the functions for these services stabilize, we intend to make them more generally available. Over time we expect that all of the critical SP2 subsystems and many other third-party software will use them to provide transparent recovery.

SP2 also supports AIX High-Availability Cluster Multi-Processing/6000 (HACMP/6000*), an alternative software subsystem that allows up to eight SP2 nodes to be configured in a highly available cluster. In the past, HACMP/6000 has been available only for a cluster of RISC System/6000 workstations.

Global services. As discussed earlier in the section on Principle 7, a full single-system image is not a critical requirement for SP2; instead, much of the benefit associated with a single-system image can be derived from providing elements of this functionality as global services. Global access to specific resources such as disks, files, and communication networks is the primary requirement.

In SP2, global access to files is provided today by networked file solutions such as NFS and AFS. These provide for concurrent shared access to file data and are the basis for the globalization of this resource.

Global network access is provided via normal network routing functions and TCP/IP and UDP/IP support over the switch. In this way, SP2 nodes that are not physically attached to an external network still have the ability to communicate through nodes that are physically attached (i.e., through gateway nodes).

Global access to disks is provided by the virtual shared disk [16] support introduced earlier in the section on programming models in commercial computing. Using VSD, an application running at any SP2 node can transparently access any disk, physically located on any other SP2 node, as if it were locally attached to that node. This is done by trapping a request for a remote shared disk at the disk driver level and shipping the request to the corresponding node. In effect, VSD is a device driver layer that sits on top of the AIX Logical Volume Manager and exports a raw device interface. If the access is to a shared disk that is locally connected, the VSD layer passes the request directly on to the Logical Volume Manager on that node. If, however, the access is to a shared disk attached to a remote node, the VSD layer sends the request to the VSD on that remote node (through the switch), which in turn passes it on to the remote Logical Volume Manager for access. The response is returned to the VSD on the originating node and to the requesting application.

VSD is highly optimized for performance, both in terms of remote disk access bandwidth, and in terms of the overhead on the processor to access remote data. For instance, when a request is shipped to a remote node, causing a communication adapter interrupt at the remote node, the corresponding disk access is initiated at the interrupt level. By contrast, network file systems such as NFS need to schedule a daemon at the remote server node, and consequently have a much higher processor overhead and lower bandwidth than VSD. The aggregate bandwidth supported at a server node is about an order of magnitude higher than that for remote access for typical network file systems. Remote disk access for a single client using VSD is close to the disk bandwidth for sequential access, and slightly lower than the local access rate for random access of small data blocks.

VSD provides transparent switching to an alternative server for recovery. For this feature, each VSD server is logically paired with an alternate secondary server. The disks must be twin-tail attached to both the primary and secondary server nodes. However, only the primary tail is active normally. The heartbeat service is used by the primary and secondary nodes to monitor one another. If the secondary node detects a primary node failure, control of the disks attached to the failing node is transparently switched over to the secondary server, and requests to those disks continue to be serviced by the secondary server node transparently to the application. The SP2 availability services are used by VSD to implement this transparent recovery capability.

Currently, VSD is primarily required by database subsystems based on the data-sharing architecture that requires access to disks from any node.

Another global service is the system data repository (SDR). The SDR is a distributed data repository that provides data storage and retrieval across nodes of the SP2 and the control workstation. It contains system-wide information about the nodes and switches and their configuration, and about the jobs currently in the system. It is used by system management, hardware monitor, parallel file system, switch subsystem, and Resource Manager.

The SDR is the result of an important design decision to require that critical system data (such as job management and system management data) not only be local to the servers that implement the function; instead, the server data must be maintained externally in a system-wide repository (that being SDR). The rationale for this is twofold. First, it eliminates redundancies and inconsistencies between servers using the same data. But more importantly, it is the first step toward solving the server availability problem in large scalable parallel systems. The goal is that critical global servers (such as the Resource Manager and file systems) should be restartable. If a primary system server fails, then it should be possible to restart the server without affecting other servers and user jobs in the system in any catastrophic manner. During the period of time when the failing server is unavailable, requests to that server will cause the requester to be delayed, but there should be no other sympathetic failures caused by the server outage. The way that a failing server gets restarted is by inspecting the SDR objects that contain all the necessary information for it to bring itself back to its state prior to failure.

System management. The SP2 system management and system monitoring software provides the administrator with a single point of control for managing and monitoring the SP2 system. The underlying philosophy was to build upon many of the system management tools, commands, and interfaces already available for the AIX/6000* workstation environment. SP2 system management extensions facilitate performing traditional system management functions in a system with multiple nodes, each running its own independent copy of the operating system. These functions include system installation, system operation, user management, configuration management, file management, security management, job accounting, problem and change management, hardware monitoring and control, and print and mail services.

The single point of control is via the control workstation that acts as the system console. It is a separate RISC System/6000 workstation that connects to each SP2 frame (the nodes and switch board) in the system. By logging on to the control workstation from anywhere on the customer network, an administrator can remotely manage the entire SP2 system from a central location.

The system management software uses the UNIX distributed shell command dsh and the secure client/server system sysctl for initiating system management commands from a single point of control and have them execute in parallel on multiple nodes. Sysctl is used for executing remote commands in parallel and uses the UNIX security package Kerberos for authentication. Filters are provided to facilitate specifying which nodes to operate on and for modifying and consolidating the responses in order to make them more comprehensible.

Hardware monitoring allows the administrator to monitor and control the state of the SP2 hardware. Each node, switch board, and frame in the SP2 system has a supervisor card that provides sensing of environmental conditions and control over the hardware components. Through this facility, an administrator can power nodes and switches on and off, reset nodes or switches, change the key switch position of a node, and monitor the node displays, node and frame voltage levels, node temperature, and certain hardware failures. Information can be viewed at various levels of abstractions--single node, frame, switch or total system--all from the control workstation.

Job management. Users access the SP2 system through an external network, either in batch mode or interactive mode. In batch mode, jobs are submitted using network job scheduling software. Typically that will be LoadLeveler* [38] or NQS. [39] Currently, LoadLeveler is the only supported mechanism for submitting parallel batch jobs. For interactive access, a user logs on to the SP2 system via the standard distributed UNIX rlogin command. The logon can be directed to a specific SP2 node by using the corresponding network address. Alternatively, the SP2 host name can be specified, in which case the user is given logon access to a lightly loaded SP2 node.

The SP2 Resource Manager is a parallel job server that runs on one of the SP2 nodes and manages the allocation of nodes to parallel jobs entering the SP2 system. The nodes of an SP2 system can be divided into a serial pool and one or more parallel pools. Nodes in a parallel pool are used exclusively for running the processes of parallel jobs. Nodes in the serial pool are used to run serial jobs and to run the Partition Manager for a parallel job.

When a parallel job is initiated (either interactively or as a batch job via LoadLeveler), a Partition Manager process for that job is started on a node in the serial pool. The Partition Manager connects to the Resource Manager and requests the required number of nodes; if available, the Resource Manager returns a list of the nodes that would form the parallel partition for that job.

Once the parallel partition is established, the Partition Manager loads a copy of the executable code for the job and starts a Partition Manager daemon on each of the nodes in the partition. The Partition Manager daemon on each node then starts execution of the process. When the application ends (either normally or abnormally), the Partition Manager is responsible for cleanup and orderly termination, and for returning the nodes in the partition to the Resource Manager.

If the node on which the Resource Manager is running fails, a backup Resource Manager is automatically started on another SP2 node. Transparent recovery is accomplished by the backup Resource Manager reinstating the original state by retrieving all job and node resource data from the system data repository.

Reliability and availability features. In the design of the system we paid particular attention to minimizing the probability of catastrophic failures. A catastrophic failure is defined as a complete system failure and restart is not possible, even in degraded mode, until service is applied. In particular, the system is designed to be tolerant of single node failures; the effect of a failing node is isolated to applications using that node. Independent operating system images on each node make this possible. As previously described in the section on the communication subsystem, the SP2 communication subsystem is designed for transparent recovery from most common hard and soft errors. Support for Redundant Array of Inexpensive Disks (RAIDs), multi-tailed devices, and mirroring is provided for storage media recovery.

In the most common situations, hardware and software service or upgrade can be applied to a node or a part of the system while the rest of the system remains in operational mode. A node in an SP2 system can be powered off and disconnected from the rest of the system to perform repairs on the node, and later replaced and "powered-up" while the rest of the system is operational.

The SP2 frames are designed for reliable, remotely controlled operations with concurrent maintenance and upgrade capability. Each frame has redundant main power supplies to reduce the chance of a system outage due to a power supply failure.

Support for transparent recovery of software subsystems and servers is provided by the SP2 availability services, as described earlier. Currently these services are used by the VSD and the Resource Manager as well as one of the database subsystems. As these services stabilize and are made more generally available, we expect many of the other SP2 system software components and other third-party software subsystems to use them to provide transparent recovery.

Similarly, the system data repository (SDR) provides a convenient mechanism for externalizing critical system data from the servers that use them. This way servers can be designed to be restartable after a failure using a backup copy on another node and re-establishing the original state by retrieving the data from the SDR. Currently the SP2 Resource Manager uses this capability to provide transparent recovery. Over time we expect other critical system servers to use this capability.

Performance

The primary determinant of system performance for a parallel system is the performance capability of the two primary building blocks--the individual nodes and the communication subsystem used to interconnect the nodes. We first discuss the capabilities of these building blocks in the SP2 and then look at system-level performance.

Table 1 shows the performance of the three different SP2 node types for several common benchmarks. SPECint92 and SPECfp92 [40] are, respectively, suites of integer and floating-point code fragments defined by the System Performance Evaluation Cooperative (SPEC); these fragments run entirely out of the processor cache on typical microprocessors and therefore do not stress the memory subsystem. Linpack 100 and Linpack TPP benchmarks [41] are solutions of relatively small systems of linear equations, but typically do not fit in the processor cache and therefore represent a different aspect of technical computing application performance than SPECfp92. Thus the Linpack and SPEC results should not be compared with each other. Linpack performance of SP2 nodes is significantly higher than that of nodes used in other scalable parallel systems available today. [41] The powerful nodes with robust configuration capability are a fundamental advantage of the SP2 system and are one of the primary contributors toward the good system-level performance of the SP2, as we show later.

The primary standard benchmarks in commercial processing are benchmarks defined by the Transaction Processing Council (TPC), [40] which also publishes the results of the benchmark measurements on different systems. There are several TPC benchmarks representing different types of commercial workloads. The TPC-C benchmark is representative of database transactions in a retail operation. Again, considering single processor performance, the SP2 nodes are considerably more powerful than the nodes used in other scalable parallel systems. [41] For example, the published TPC-C measurement for the SP2 wide node (which is equivalent to the RISC System/6000 Model 590) is 726.1 tpmC (TPC-C transactions per minute) with a corresponding $/tpmC of 1395.

Generally speaking, the SP2 wide node provides the highest performance in most environments; the SP2 thin node 2 is typically within 10 percent of the wide node performance, while the SP2 thin node is typically 20 to 30 percent lower. In environments where performance is gated by the amount of node memory and I/O configurability, the wide node generally provides the best solution; otherwise, the thin nodes generally provide better price/performance.

The other primary performance determinant is the communication subsystem. Table 2 shows latency and sustainable bandwidth measurements using different SP2 nodes. The latency measures the time to send a zero-byte message from one process space to another on a remote node; it accounts for all hardware and software path lengths between the sending application process and the receiving process. The point-to-point bandwidth measures the sustainable rate at which a process running on one node can transfer data to another application process running on another node; it assumes very large messages, one-way data transfer, and accounts for all hardware and software overheads involved in the transfer path. Finally, the exchange bandwidth measures the sustainable rate at which two application processes running on two different processors can exchange data in a simultaneous send and receive (two-way transfers). The performance is measured using the IBM Message-Passing Library (MPL), both with user-space communication over the switch and also through UDP/IP over the switch. Both paths are simultaneously available to users and, depending on the application, either or both may be used.

The sustainable pair-wise bandwidth of the SP2 is comparable to that of other scalable parallel systems available today. Many of the scalable parallel systems available today provide higher switch link bandwidths than SP2, but when the switch topology effects, message payload capability, and other system features are taken into consideration, the sustainable bandwidth between random pairs of processors in the system is comparable to that of the SP2 system.

However, compared to distributed shared-memory systems (such as the Cray T3D), the latency of the SP2 system is higher. This can affect parallel application performance adversely if the application is structured for frequent fine-grained communication. But it is important to remember that distributed shared-memory systems access data synchronously and in units of cache line size. In contrast, a distributed message-passing system can transfer data asynchronously and in very large units. In situations where the data to be transferred can in fact be aggregated into one large message, the longer latency of the message-passing machine is not an issue, and the asynchronous nature of the transfer can be an advantage.

For parallel application execution capability, there are two standard benchmarks that are widely used today, namely the Linpack HPC and the NAS Parallel Benchmark suite.

The Linpack HPC is functionally similar to the other Linpack benchmarks discussed earlier, but it does not specify any problem size; instead, it allows the problem size to grow with the size of the system under test. Its performance is a function of the floating-point capability of the nodes, the memory size, and the interprocessor communication performance of the system. Figure 11 shows that for an equivalent number of nodes, the SP2 delivers significantly higher performance than other scalable parallel systems that have reported the Linpack HPC performance. [41] Other systems shown are the Cray T3D, the Thinking Machines CM5**, and the Intel Paragon.

The NAS Parallel Benchmark suite, developed by the Numerical Aerodynamic Simulation (NAS) project at the National Aeronautics and Space Administration (NASA) Ames Research Center, [42] consists of five kernels and three pseudoapplications that collectively represent the NASA/Ames scientific computing workload. The kernels have simple data structures and each represents a fundamental computational structure in NASA's aerodynamic simulation applications. The simulated applications have multiple interacting data structures, and are more closely related to the full applications in use by NASA scientists. The kernels address specific components of system performance, while the simulated applications examine the overall system performance that is attained by combination of the components.

Figure 12 shows the performance of the SP2 system (using wide nodes with a fully configured memory subsystem) using the Class B NAS Parallel Benchmarks, compared to some other massively parallel systems, as reported in Reference 43 The results are shown in Cray C90 single processor equivalents, and are for 128-node systems (except for three cases on the Intel Paragon where data are only available for a different number of processors, as noted in the figure). The SP2 significantly outperforms other equivalent size systems, especially on the pseudoapplications that are closely related to full applications in use by NASA scientists. The implementations of the NAS benchmarks on the SP2 are described in depth in References 44 and 45.

To characterize the SP2 relative to commercial computing requirements, we started with intrinsic database operations for decision support environments and have tested their scalability on SP2 systems with increasing numbers of nodes. For these measurements we have used the Wisconsin Benchmark [40] that tests the following basic operations on a 1 GB database: (1) load database, (2) create index, (3) scan database and aggregate result, and (4) sort, merge, and join.

It is important to realize that when evaluating the performance of the SP2 system in this context, the performance of the database subsystem itself is as significant as that of the hardware platform. The two database subsystems that have been ported for parallel execution on the SP2 are the Oracle Parallel Server (Version 7.1) and the DB2* Parallel Edition. The elapsed time for these operations on SP2 with DB2 Parallel Edition is shown in Figure 13 and shows generally linear speedup with good parallel efficiency. Measurements for Oracle 7.1 will be available in the future. We also expect to do measurements on TPC-D (decision support) and TPC-C (on-line transaction processing) benchmarks defined by the Transaction Processing Council.

Key future challenges

We expect that the future evolution of SP2 systems will continue to be guided by the principles that we articulated in the section on system strategy. In particular, we will continue to exploit workstation technology as much as possible. However, as the market for the scalable parallel systems expands, we will continue to assess technologies that are specifically targeted to scalable parallel computing.

A symmetric multiprocessor as the basic node in a scalable parallel system is an attractive option for future systems. It can provide significant improvement in peak node performance at a cost that is not much higher than the cost of a uniprocessor with the same amount of storage. However, effective use of this performance is a challenging problem. In particular, one needs to manage two levels of parallelism, with big differences in performance and function between intranode communication and internode communication. Hiding the complexity of two levels of parallelism by providing transparent compiler exploitation will be crucial.

The communication subsystem in scalable parallel systems must continue to improve in performance and capability. Efficient support of a sharedmemory programming model requires aggressive bandwidth increase and latency reduction. This precludes interfacing to the switch via the I/O bus in future SP2 systems. Direct connection to an internal memory bus will lead to improved performance, but will require a more specialized package and interlocking of adapter technology and processor technology.

Aggressive latency reduction will require a communication architecture that will allow a processor to access remote memory without involving the remote processor. Such functionality is already available in one form or another on some systems. However, current implementations either compromise on performance (latency), or require a global real address space. The challenge is to provide such functionality with very low latency on a virtual memory system with minimal impact on system interfaces, while still maintaining the distributed message-passing system model we have adopted for scalability and availability.

Very large and scalable information servers (including video and multimedia servers) require the sharing and movement of massive amounts of data among nodes. A key metric here is the processor overhead for moving data. Minimizing processor overhead and supporting large throughput, rather than low latency, is critical.

There is limited experience today with scalable storage systems and scalable file systems, and hence limited knowledge to guide system architects in the design of future parallel I/O subsystems. Yet low-cost, scalable I/O is key to the success of scalable parallel computing. The High-Performance Storage Subsystem (HPSS) initiative (a cooperative effort between key government labs and several vendors of high-performance systems) has generated solutions to some of the key issues in this area. The challenge is to quickly incorporate these solutions into mainline products.

Efficient utilization of future scalable parallel computers will require major advances in software technology. Efficient compiler technology is needed to support more convenient programming models while preserving portability to other platforms. Future systems will have to support efficient time sharing of processors by parallel jobs, and preemptive scheduling of parallel computations. This requires a mechanism for coordinated dynamic allocation of processors and memory to the processes that constitute a parallel job; in particular, support for gang (simultaneous) scheduling is required. Also, users should be able to control the execution of parallel jobs (e.g., via a parallel shell), and connect parallel computations (e.g., via parallel pipes), with the same convenience with which they control or connect individual processes today.

The availability of parallel application codes and subsystems is critical for the success of scalable parallel computing. The development of such codes is hampered not only by the difficulties of parallel programming, but also by the lack of accepted standards in this arena. High Performance FORTRAN and the Message-Passing Interface (MPI) are two successful examples of establishing such (de facto) standards. Much more is needed. In particular, standards are needed in the area of parallel system services, such as parallel I/O.

Another aspect, particularly in commercial applications, is that of providing coordinated and consistent recovery between a large number of subsystems running on any scalable parallel system. These could include multiple database systems, a transaction monitor, a parallel file system, a multimedia server, resource managers, the operating system, the global I/O subsystem, etc. Each subsystem may detect several software failure conditions on its own, and these need to be combined with other failure detection mechanisms (such as the heartbeat service), for overall diagnosis. Ordering, synchronization, and parallelism during recovery of the various subsystems is needed. Providing a high-availability infrastructure that addresses these issues is a challenging problem.

Meanwhile, the software technology used in sequential systems is progressing. Object-oriented (OO) programming and object-oriented storage (object-oriented databases, object-oriented file systems) are becoming more prevalent; operating system technology is evolving to better support OO. Future scalable parallel systems will have to cope with this evolution, both because of its impact on the basic uniprocessor software technology, and because of the importance of OO for future applications. Scalable OO technologies are yet to emerge from the research community. Yet such technologies are likely to have a major impact on the future of scalable parallel systems.

Summary

In this paper we described the seven guiding principles that form the basis for the SP2 system architecture, and how these influenced the system overview and system components. Our contention was that these principles should let us bring to market scalable parallel solutions in a timely manner for a wide range of applications. This has largely been borne out by the success and broad-based market acceptance SP1 and SP2 systems have enjoyed since the SP1 systems became generally available. Around 400 SP1 and SP2 systems (with 2 nodes all the way up to 512 nodes) are today being used productively in many different areas, including computational chemistry, crash analysis, electronic design analysis, seismic analysis, reservoir modeling, decision support, data analysis, on-line transaction processing, local area network consolidation, and as workgroup servers. They are being used in diverse industry areas including manufacturing, process, distribution, transportation, petroleum, communications, utilities, education, government, finance, insurance, and travel.

Time-to-market with the latest technology has been one of the keys to the success of the SP1 and SP2. The SP1 system was introduced in the market roughly one year after the POWERparallel system* program was started. This was accomplished through Principles 1, 2, and 3 by using many standard RISC System/6000 hardware and software components and leveraging interprocessor communication technology developed in IBM Research. The SP2 was delivered less than a year after the SP1. The loosely coupled system structure allowed significant enhancements in the communication protocols and a very fast introduction of the new POWER2 microprocessor.

Another key to success has been the availability of key applications. A wide range of IBM and vendor applications and subsystems run in parallel on the SP2. These applications span a spectrum of areas: computational chemistry and pharmaceuticals, engineering analysis, electronics analysis, and petroleum explorations and production in the technical computing area; database management systems, transaction processing monitors, and business applications systems that run on top of database subsystems, for the commercial computing area; and general tools for system management, network management, and storage management. Most are currently available.

Principles 1, 4, 5, and 7 are the primary reasons behind the broad portfolio of parallel applications. Support of Principles 1 and 4 also contributes to the several thousands of RISC System/6000 applications available on the SP2. The support for standard distributed AIX program development and execution environment, key programming models, standard message-passing libraries, globalization of key resources, and availability services has allowed independent software vendors to modify key applications and subsystems to run on the SP2 without diverting from their mainstream development strategies.

Our strategy of using standard microprocessors and a mature operating system, complemented with a set of high-performance services has provided other benefits. The system provides industryleading performance as shown by most key benchmark results. The system installs easily and exhibits good availability characteristics. The flexible system architecture and a variety of applications allow users to start doing real work immediately. For example, a large 400-node SP2 has been installed at the Maui High-Performance Computing Center. Within a few weeks of delivery, Professor Frank L. Gilfeather, one of the principal investigators, provided the following testimonial: "We are seeing availability (considering down time from all causes) of 98 percent and are achieving 70-90 percent CPU utilization. Compared to other MPP machines, the SP2 has set a new standard in ease of installation and use, availability, reliability, and cost/performance."

This broad-based acceptance attests to the flexible and general-purpose nature of the architecture and validates the principles we used to design the SP2 to effectively address the requirements of a wide range of applications and industries.

Acknowledgments

Ray Bryant, Daniel Frye, Kevin Gildea, and Don Grice were part of the SP2 systems team and made significant contributions to the SP2 systems architecture. Christos Polyzois and Jim Rymarczyk were instrumental in the definition of SP2 architecture enhancements for commercial computing.

The overall success of SP2 is the result of contributions of many other people in all the various phases in the development and delivery of a system--system architecture and system design, hardware and software design and development of the different system components, system test, manufacturing, applications enabling, performance benchmarking, marketing support, and service. They are too numerous to acknowledge them all individually here but their contributions were critical.

Much of the success can also be attributed to the close working relationship between the IBM POWER Parallel Division and the IBM Research Division, and others from the RISC System/6000 hardware group, the former Federal Systems Marketing Division, IBM Software Solutions, and the European Center for Scientific/Engineering Computing. Joint partnerships with some of our key customers helped guide and validate many of our architecture and design decisions.

*Trademark or registered trademark of International Business Machines Corporation.

**Trademark or registered trademark of X/Open Co. Ltd., Tandem Computers Incorporated, Sun Microsystems, Inc., or Thinking Machines Corporation.

Cited references

Accepted for publication February 21, 1995.