|
Parallel and clustered systems initially found in numerically
intensive markets are gaining increasing acceptance in commercial
segments as well. The architectural elements of these systems span a
broad spectrum that includes massively parallel processors that focus
on high performance for numerically intensive
workloads [1] and cluster operating
systems that deliver high
system availability. [2] This
paper describes new clustering
functions that are implemented by IBM's S/390*
processors and OS/390* operating system.
The S/390 cluster (parallel system complex, or
Parallel Sysplex*) contains innovative multisystem data-sharing
technology, allowing direct, concurrent read/write access to shared
data from all processing nodes in the parallel configuration, without
sacrificing performance or data integrity. Each node is able to
concurrently cache shared data in local processor memory through
hardware-assisted cluster-wide serialization and coherency controls.
This in turn enables work requests associated with a single workload,
such as business transactions or database queries, to be dynamically
distributed for parallel execution on nodes in the sysplex cluster,
based on available processor capacity. Through this state-of-the-art
cluster technology, the power of multiple OS/390
systems can be harnessed to work in concert on common workloads, taking
the commercial strengths of the OS/390 platform to
improved levels of competitive price/performance, scalable growth, and
continuous availability.
In this paper we review the S/390 Parallel Sysplex
architecture, its core technology components, and the customer business
objectives that shaped the overall system structure. In Part I we
discuss the objectives that guided the Parallel Sysplex designers and
introduce the technology components of the Parallel Sysplex cluster.
Part II presents an overview of the coupling facility
(CF) and coupling support facility architectures,
and discusses the scalability of the S/390 Parallel
Sysplex. A concluding section summarizes the key points contained in
the paper.
PART I
Design overview
This section summarizes the key design points for the
S/390 Parallel Sysplex and relevant design
rationale. We begin with a set of objectives that guided the overall
system structure. This is followed by a description of design benefits
derived from its data-sharing capabilities. Then, alternative cluster
architecture models are discussed. Finally, Parallel Sysplex technology
functions are introduced.
Customer business objectives. One key customer business
objective was to reduce the total cost of computing for
S/390 systems. There are many examples of systems
that use low-cost microprocessors as a building block for a large
system. In order to obtain the same cost advantages as these systems,
the most dramatic change for S/390 meant replacing
the S/390 bipolar processor technology with
complementary metal-oxide semiconductor (CMOS)
microprocessor technology and clustering multiple systems together to
meet aggregate capacity requirements. This strategic decision
enabled the S/390 systems to leverage
industry-standard CMOS technology for
price/performance advantage, both in terms of reduced base
manufacturing costs and significant ongoing customer savings in reduced
power, cooling, and floor space requirements.
A closely related objective was to provide a commercial platform that
would support the nondisruptive addition of the scalable processing
capacity, in increments matching the growth of workload requirements
for customers, without requiring re-engineering of customer
applications or repartitioning of databases. Satisfying this
objective was critical to the design of the Parallel Sysplex
shared-data cluster architecture, which will be discussed later in this
paper. Prior to Parallel Sysplex, S/390 customers
had been forced to contain the capacity requirements of a workload
within the technology limits imposed by the size of the largest single
symmetric multiprocessor system available. Workload growth beyond these
limits required splitting the workload and repartitioning the database
between the nodes--a complex, resource-intensive process not supportive
of customer business objectives.
A third key business objective was to address the increasing customer
demands for improved application availability, not only in terms of
failure recovery, but for the more important reduction of planned
outage times. Today, there is less opportunity for planned systems
shutdowns in the global economic environment. Here again, meeting this
objective was key to the Parallel Sysplex cluster design.
Another key business objective was to protect investments customers
have in existing applications. There were two aspects to this
objective. First, the Parallel Sysplex technology had to be introduced
in a compatible manner with existing applications. Second, the benefits
of parallel processing had to be transparently applied to applications
through exploitation of the technology by application subsystems and
database managers. With few exceptions, these objectives have
been met. The Parallel Sysplex technology extensions to the
S/390 architecture (introducing new
CPU instructions, new channel subsystem technology,
etc.) are fully compatible with the base S/390
architecture. The IBM subsystem transaction managers
in the Customer Information Control System (CICS*)
and the Information Management System (IMS*), and
the key subsystem database managers such as DATABASE
2* (DB2*) and IMS-DB,
have exploited the data-sharing technology while preserving their
existing interfaces.
A final objective was to logically present a single-system image to
users, applications, and the network, and to provide a single point of
control to the systems operations staff. Meeting this objective was key
to controlling the overall cost of managing a multisystem
configuration. In a Parallel Sysplex environment, many cluster
technology components, both hardware and software, have been developed
to meet this objective. New data-sharing technology hardware enables
multiple-system nodes to serve common workloads with the appearance of
a single large computing resource. Base operating system cluster
services [3-5] provide robust
intersystem communication,
system monitoring, and automatic failure takeover mechanisms. Shared
consoles are provided for managing multiple operating systems and
multiple underlying hardware system nodes with a single point of
control. Key system profiles, catalogs, and other resources can be
shared across the clustered systems to enable efficient "cloning"
of system definitions. Through these and other means, systems
management costs do not increase linearly as a function of the number
of systems in the sysplex. Rather, total cost of computing efficiencies
of scale accrue through the coordinated management facilities of the
Parallel Sysplex cluster.
Data-sharing design benefits. Given the customer business
objectives outlined above, the Parallel Sysplex shared-data
architecture and technology was critical to delivering the following
system benefits: dynamic workload balancing, continuous availability,
and continuous operations.
Dynamic workload balancing. A key aspect of being responsive
to changing business needs in a commercial parallel processing
environment involves the ability to dynamically adjust system resources
to best satisfy workload performance objectives in terms of throughput
and response times. In the S/390 Parallel Sysplex
environment, the high-performance data-sharing technology provides the
means for OS/390 and its subsystems to support
dynamic workload balancing across the collection of systems in the
configuration. Functionally, workload balancing can occur at two
levels. During initial connection to the cluster, clients can be
dynamically distributed and bound to server instances across the set of
cluster nodes to effectively spread the workload. Subsequently, work
requests submitted by a given client (such as transactions) can be
executed on any system in the cluster based on available processing
capacity. The work requests do not have to be directed to a specific
system node due to data-to-processor affinity, which is typically the
case with alternative data-partitioning parallel systems, wherein
buffer coherency and serialization controls are not cluster-wide in
scope. In a Parallel Sysplex cluster environment, work will normally
execute on the system on which the request is received, but in cases of
"over-utilization" on a given node, work can be directed for
execution on other less-utilized system nodes in the cluster. For both
on-line transaction processing (OLTP) and
decision-support workloads, dynamic workload balancing across systems
can be made transparent to the customer applications or users.
Continuous availability. Within a Parallel Sysplex cluster it
is possible to construct a parallel processing environment with no
single points of failure. Parallel Sysplex hardware components such as
sysplex timers and coupling facilities (to be discussed in detail
later) can be redundantly configured. The sysplex timer serves as a
common time reference source for systems in the sysplex, distributing
synchronizing clock signals to all nodes. The coupling facility
(CF) is the key Parallel Sysplex technology
component providing state-of-the-art cluster data-sharing functions. If
a coupling facility fails, critical data contents can be "rebuilt"
into an alternate CF under OS/390
system and subsystem control. Since all systems in the Parallel Sysplex
can have concurrent access to all critical applications and data, the
loss of a system due to either hardware or software failure does not
necessitate loss of application availability. Peer instances of a
failing subsystem executing on remaining healthy system nodes can take
over recovery responsibility for resources held by the failing
instance. Alternatively, the failing subsystem can be automatically
restarted on still-healthy systems using automatic restart capabilities
to perform recovery for work in progress at the time of the failure.
While the failing subsystem instance is unavailable, new work requests
can be redirected to other data-sharing instances of the subsystem on
other cluster nodes to provide continuous application availability
across the failure and subsequent recovery.
Continuous operations. The same availability characteristics
associated with handling unscheduled outages are applicable to planned
outages as well. A system can be removed from the Parallel Sysplex for
planned hardware or software reconfiguration, maintenance, or upgrade.
New work can be dynamically redistributed across the remaining set of
active systems. Once the system is ready to be brought back on line, it
can be reintroduced into the sysplex in a nondisruptive manner and
participate in dynamic workload balancing as described earlier.
New system nodes can be introduced into the Parallel Sysplex in a
similar fashion. That is, the already-running systems continue to
execute work concurrent with the activation of the new system node.
Once the new system is active, it can become a full participant in
dynamic workload balancing. New work requests are naturally driven at
an increased rate to that system until its utilization has reached
steady state with respect to the demand for overall processor resources
across all system nodes in the Parallel Sysplex. This capability
eliminates the need to shut down the entire cluster to repartition the
databases and retune workloads for each system to distribute work
evenly after introduction of the new system into the configuration, as
is typically required with a data-partitioning parallel processing
system.
A further design objective for the Parallel Sysplex was for new
releases of OS/390 and its key subsystems to support
the current and the next release migration coexistence. This allows new
software product release levels to be rolled through the Parallel
Sysplex one system at a time, providing continuous application
availability across the systematic migration install process.
Cluster architecture models. Clustering, as a way of
organizing computer systems, was surveyed by Pfister
[6] who
identified a cluster as "a type of parallel or distributed system
that consists of a collection of interconnected whole computers and is
utilized as a single, unified computing resource." The individual
cluster nodes can be either uniprocessor or symmetric multiprocessor
(SMP) systems. Although the computers may be
connected by a high-speed communication mechanism, they do not share
any central (main) storage.
Another viewpoint [7,8] classifies
parallel systems based on
conformance to one of the following architecture models, each having
its own strengths and weaknesses: the shared-nothing model, the
shared-disk model, and the shared-everything model.
The shared-nothing (data-partitioning) model. Each system owns
a portion of the database, and each portion can only be read or
modified by the owning system. Data partitioning enables each system to
locally cache its portion of the database in processor memory without
requiring cross-system communication to provide data access concurrency
and coherency controls. Scalability characteristics are excellent with
this approach.
However, there are limitations imposed in a commercial processing
environment by such a design point. [9,10]
Significant
capacity planning skills and cost are required to tune the overall
system to match the processing capacity for each cluster node with the
projected workload access rate to data owned by that node. Real-time
workload demand fluctuations can over- or under-utilize processor
resources. Repartitioning of the cluster databases to introduce new
cluster nodes for additional capacity requires the entire cluster to be
shut down.
The shared-disk (shared-data) model. All of the disks
containing databases are accessible by all of the systems. The basic
strength of this approach is that it allows a workload to be
dynamically balanced across nodes of a cluster, which also has
potential benefits for availability and continuous operations, as
discussed earlier. However, the major drawback to shared-data models
prior to the Parallel Sysplex architecture has been poor scalability
characteristics.
In shared-data configurations, distributed lock management protocols
are employed to provide concurrency (serialization) controls across the
cluster, generally involving message passing between the systems
on mainline paths to obtain and release locks. This is necessary
to ensure that only one system is allowed to modify a given shared-data
item at a time. Global (cluster-wide) buffer coherency controls are
required in order to ensure that the currency of data items cached in
local buffers in the local processor memory for each system can be
determined prior to buffer reuse.
One approach to a shared-disk architecture employs broadcast-invalidate
mechanisms to provide coherency control, sending
cross-invalidate signals to all other nodes whenever a system updates a
copy of a shared-data item locally. This is done to inform the other
nodes that their locally cached copy of the shared-data item is now
"down level." This approach scales poorly as the number of nodes in
the cluster increase. An alternative approach avoids the
broadcast-invalidate protocol, by continuing to hold the lock on a
valid locally cached data item after the transaction ends. This allows
the cached copy to be subsequently reused locally with integrity.
Ownership of the lock is released only in the presence of contention
from other systems. However, with this approach, only one system can
maintain a current local cache copy of a given data item in memory at a
time, that is, while the lock on that data item is held. Ownership of
the current data item copy transfers or "pings" from one system to
another as references to those data are made.
Regardless of the global coherency protocols used, these cross-system
"ping" effects occur whenever a system determines that it does not
have a current copy of a needed shared-data item. This typically
results in the data being pushed out to shared disk by the system in
the cluster owning a current copy, where the data are then fetched by
the requesting system node. These multisystem data transfer
I/Os can cause significant performance degradation
in the cluster if a high degree of multisystem interest in the shared
data is present.
The shared-everything model. Central storage, as well as
disks, are shared by all of the processors. This approach is used in
structuring an SMP. An SMP is not
a clustered system by itself, but can serve as the system building
block for individual nodes of a cluster. Shared-everything
architectures have processing efficiency advantages when applied across
a relatively small number of processors, but do not generally scale
well as the number of processors increases. Also, single points of
failure compromise the availability characteristics of the processing
system.
A more detailed comparison of alternative cluster architectures with
respect to performance and scalability is discussed in
Reference 10.
Parallel Sysplex cluster technology. The
S/390 Parallel Sysplex architecture is generally
characterized as a shared-data model. Its fundamental
distinguishing characteristic over traditional shared-disk
architectures is that the Parallel Sysplex technology enables multiple
systems to cache the same data concurrently in local processor memory
with full read/write access control and globally managed cache
coherency, with high-performance and near-linear
scalability.
Specialized hardware and software cluster technology is introduced to
address the fundamental performance obstacles that have traditionally
plagued data-sharing parallel-processing systems. The core hardware
technologies are embodied in the CF (for data
sharing) and the coupling support facility (for communication between
processors and the CF) components of the system and
are discussed in detail later in this paper. Some of the most critical
functions provided are outlined below:
- Hardware-assisted global concurrency controls. Specialized hardware
is provided to support low-overhead, fine-grained global lock
management with hardware-assisted lock contention detection. In the
absence of lock contention, locks can be efficiently granted and
released without intersystem software message passing.
- Hardware-assisted global buffer coherency controls. The
CF and coupling support facilities combine to track
the locally cached shared-data items for each system, providing
low-overhead mechanisms for global buffer cross-invalidation. The
cross-invalidate operations do not involve software message passing,
nor do they interfere with normal processor instruction execution.
Cross-invalidate signals are only sent to nodes with registered
interest in a data item being updated--not broadcast to all nodes in
the cluster. Further, local buffer coherency can be checked by the
program buffer managers on each node via new CPU
instructions that access local processor memory.
- Synchronous locking and buffer coherency request handling. High-speed,
low-latency links using streamlined protocols are provided, allowing
locking, caching, and queuing operations directed to a
CF to generally be completed instruction
synchronously. That is, in certain cases, delaying further
CPU instruction processing while the
CF executes an operation
CPU-instruction-synchronously consumes
fewer machine cycles than would otherwise be consumed by allowing
CPU instruction processing to continue while the
CF executes the operation asynchronously,
thus forcing the software to perform a task switch to suspend and later
resume the requesting unit of work (after the CF
completes the operation). This can be contrasted with the intersystem
software message-passing costs to obtain and release a lock in typical
distributed software lock management protocols, or with the several
milliseconds that are required for a typical disk operation.
- Global shared-buffer cache. The CF has its own
processor memory that can serve as a global cache to enable high-speed
local buffer refresh following a local cache miss. The operation to
retrieve data from the coupling facility can be performed
CPU-synchronously if the requested data item is up
to 4 kilobytes in size. Data transfers of up to 64K are performed
asynchronous to the initiating CPU. In either case,
the cost of a disk I/O or intersystem message
passing to "ping" ownership of the data item from one system to
another is avoided when the data are resident in the coupling facility.
- Hardware-assisted shared queuing constructs. The CF
supports general-purpose data-sharing queuing functions that are
applicable for a wide range of cluster-wide uses, including workload
distribution, intersystem message passing, and the maintenance of
shared control block state information.
S/390 Parallel Sysplex cluster
This section provides an overview of the technical capabilities of
the S/390 Parallel Sysplex. It covers the overall
system structure, the basic operating system support for parallel
processing, and the advanced technology introduced to enable efficient
clustering or "coupling" of system nodes.
System model. An S/390 Parallel
Sysplex [11,12]
is a cluster of interconnected processing
nodes with attachments to shared storage devices, network controllers,
and core cluster technology components, consisting of coupling
facilities, coupling support facilities, and sysplex timers. (See
Figure 1.) A coupling facility (CF)
enables high-performance read/write sharing of data by applications
running on each node of the cluster through global locking and cache
coherency management mechanisms. It also provides cluster-wide queuing
mechanisms for workload distribution and message passing between nodes.
Another component, a coupling support facility, resides on each of the
processing nodes and is responsible for communications between the
nodes and the coupling facility. A sysplex timer serves as a common
time reference source for systems in the sysplex, distributing
synchronizing clock signals to all nodes. This enables local processor
time stamps to be used reliably on each node and synchronized with
respect to all other cluster nodes, without requiring any software
serialization or message passing to maintain global time consistency.
The synchronized time reference source facilitates real-time
or post-processing merges of transaction manager logs across systems,
for example, to provide coordinated transaction and database recovery
across the cluster for a shared workload.
Figure 1
The Parallel Sysplex currently supports up to 32 processing nodes where
each node is a symmetric multiprocessor containing between 1 and 10
processors. The nodes do not have to be homogeneous; that is, mixed
configurations supporting both S/390 CMOS processor
systems and traditional ES/9000* bipolar systems can
be deployed. The basic processor design has a long history of
fault-tolerant features. [13]
The disks are fully connected to
all processors. The I/O architecture has many
advanced reliability and performance features (e.g., multiple paths
with automatic reconfiguration for availability). The basic
I/O architecture is described in
Reference 14 and
one aspect of the dynamic I/O configuration is
described in Reference 15.
The cluster is organized in this fashion to increase the number of
processors that can be applied effectively to large business problems,
on-line transaction processing, extensive queries, and applications on
different systems that need to concurrently access and update a single
database. For example, a cluster with three ten-way
SMP nodes can utilize 30 processors to work on a
problem, with effective performance increasing nearly linearly with the
number of processing nodes.
[10,16]
On the other hand, if an
attempt is made to include more than ten processors in an
SMP, incremental effective capacity diminishes
rapidly. This is due to increasing interprocessor communication to
provide interlocked-update access to memory, processor cache
invalidation, and operating system overhead to manage processor
resources.
Base OS/390 cluster services. A set of operating system
services are provided as building blocks for construction and
management of multisystem applications, subsystems, and components.
These are described in detail later; here we only briefly cover some of
the most relevant aspects.
First, a set of cluster group membership services are provided. These
allow processes to join or leave multisystem logical groups,
communicate with other group members, and be notified of events related
to the group.
Second, the ability to provide efficient, shared access to operating
system resource state data is provided. These state data are located on
coupling data sets and many advanced functions are provided,
including serialized access to the data (with special time-out logic to
handle faulty processor nodes) and duplexing of the disks containing
the state data. In addition, there are availability enhancements for
planned and unplanned changes to the coupling data sets (e.g.,
"hot-switching" of the duplexed disks).
Third, processor "heartbeat" monitoring is provided. In addition to
standard monitoring of the health of each node, functions are also
provided to automatically terminate a failing node and disconnect the
node from its externally attached devices. This enables other
multisystem components to be designed with a "fail-stop" strategy
(performing peer recovery for a failing node with assurance that the
faulty processor does not suddenly resume processing and interfere with
recovery of shared resources). This system isolation function is system
fencing and is exploited by OS/390 as
part of sysplex partitioning actions. Sysplex partitioning
is the term used to describe the set of actions peer systems take to
remove another system node from the cluster, including physical
isolation, freeing of shared resources, and cleanup of state
information related to the system being removed. More information is
provided in the section on system fencing.
Although the use of multiple interconnected microprocessors can
aggregate large amounts of processing power, low cost can only be
achieved if the processors are efficiently utilized. Therefore, the
ability to dynamically and automatically manage system resources
is a key objective. A new component, the workload
manager, [17]
was designed to meet this objective.
A multisystem automatic restart manager (ARM)
facility is provided as a base operating system cluster component. The
ARM component is fully integrated in the Parallel
Sysplex structure and provides significantly more functions than a
traditional "restart" service. First, it utilizes the shared-state
support previously described so that at any given point in time the
ARM is aware of the state of processes on all
systems (i.e., even of processes that "exist" on failed nodes).
Second, the ARM is tied into the processor heartbeat
functions so that it is immediately made aware of node failures. Third,
the ARM is integrated with the workload manager so
that it can provide a target restart system based on the current
resource utilization across the available nodes. Finally, the
ARM contains many features to provide improved
restarts such as affinity of related processes, restart sequencing, and
recovery when subsequent failures occur. These services are described
more fully in Reference 3.
Coupling facility. At the heart of the Parallel Sysplex
coupling technology is the coupling facility (CF), a
new component providing hardware assists for a rich and diverse set of
multisystem data-sharing functions. The coupling facility architecture
provides three behavioral models to enable efficient clustering
protocols:
- Lock model: supports high-performance, fine-grained global locking
and contention detection
- Cache model: provides global coherency controls for distributed local
processor caches and a high-performance shared data cache
- Queue (list) model: provides a rich set of queuing constructs in
support of workload distribution, message passing, and sharing of state
information.
Physically, the CF consists of hardware and
specialized microcode (control code) that implements the
S/390 Parallel Sysplex architecture extensions. The
CF control code runs on the latest generations of
S/390 processors. CFs are
attached to other S/390 processors running the
OS/390 or MVS operating system
via high-speed coupling links. The coupling links use
specialized protocols for highly optimized transport of commands and
responses to and from the CF. The coupling links are
fiber-optic channels providing 100 megabyte per second data transfer
rates. Commands to the CF can be executed
synchronously or asynchronously to further CPU
instruction processing, with CPU-synchronous command
completion times measured in microseconds, thereby avoiding the
asynchronous execution overheads associated with task switching and
processor cache disruptions. Multiple CFs can be
connected for availability, performance, and capacity reasons.
Logically, the CF storage resources can be
dynamically partitioned and allocated into CF
structures, subscribing to one of the three defined
behavioral models: lock, cache, and queue models. Specific commands
are supported by each model and, while allocated,
CF structure resources can only be manipulated by
commands for that structure type as specified at initial structure
allocation. Multiple CF structures of the same or
different types can exist concurrently in the same coupling facility.
CF connection processing. A CF structure is
allocated when the first attempt is made by a program to connect to
that structure by name (see Figure 2).
CF allocation commands (Step 1 in Figure 2) are
provided to specify the type of structure to allocate, the amount of
storage to assign, and optional structure attributes that depend on the
intended usage of the program. The location of the structure (given
multiple coupling facilities to choose from) and its size are
determined by OS/390 based on customer-supplied
coupling facility resource management policy information. As part of
the connection request, the operating system creates a local state
vector via the DEFINE VECTOR instruction
(Step 2), if warranted. Local state vectors are described in the next
section on coupling support facility.
Figure 2
The vector token returned by DEFINE VECTOR, which
serves as an identifier for the vector, is passed to the
CF in an attach command (Step 3). The command
establishes a binding between the program and the CF
structure; the token is subsequently used by the CF
to deliver secondary commands (not shown in the figure) targeting the
vector during execution of specific other CF
commands. At the completion of the allocation and attach processes, the
operating system records information concerning the structure and user
status in a function data set (Step 4), returns structure attributes to
the requesting program (Step 5A), and informs the program about all
current peer programs connected to the CF structure
(Step 5B). Other connectors are similarly informed about the presence
of the new connector (Step 6). Two of the notifications are presented
by OS/390 to user program event exits, which were
specified on the OS/390 connection service
interface, and which are used to inform programs about any subsequent
status changes (Step 7) related to the CF structure.
The structure persists as long as there are connectors to it, and can
optionally persist even in the absence of any attached program users.
Related services and CF commands are provided for
disconnect and structure deallocation.
General CF characteristics. In Part II on architecture, the
CF models will be discussed in some detail; however,
it is worthwhile to introduce some general behavioral characteristics
as a frame of reference. The CF supports a number of
key functions to facilitate reliable resource management and
communication with attached system processing nodes. Some of the
functions are:
- Global commands are provided to control
CF resource management and ownership, to ensure that
resource management policies are cohesively administered by the systems
comprising a single sysplex cluster. Assignment of
CF resources by the attached operating system nodes
is predicated on authority-based conditional execution of commands
requesting resource allocation.
- A set of pathing commands are provided that enable each
attached system to establish reliable communications with an attached
CF. Information is exchanged as part of path
validation that uniquely identifies the CF and each
processing node so that reliable pathing configuration tables can be
constructed and reverified across link failures. These mechanisms
ensure that commands directed from attached systems to a
CF or vice versa (such as cross-invalidate commands)
are not inadvertently executed on the wrong target processor due to
miscabling of physical links.
- Specialized hardware and operating system software protocols
are supported to guarantee the integrity of command delivery, even in
the presence of link failures, without introducing sympathy
sickness across nodes in the cluster.
Through these link recovery mechanisms, for example, a
write command to the coupling facility initiated by a program on one
node of the cluster does not have to fail, even if the resultant
cross-invalidate signal cannot be delivered to another target node
caching a down-level version of the data item. The target system node
is guaranteed to observe the fact that its link to the
CF was impaired prior to reliance on the integrity
of its local state vectors. Upon detection of such a failure, the
affected operating system takes recovery actions to cause data-sharing
programs, on that node only, to reregister their interest in shared
resources with the CF. This is accomplished by
over-indicating the invalid state of local cache vectors (or the
nonempty state of list-notification vectors) when loss of connectivity
is detected.
- Commands to the CF are executed
atomically, i.e., they are completed in their entirety or they are
backed out at the CF in the event of failure. They
never complete with partial results being stored. This greatly
simplifies the recovery logic for systems attached to the
CF.
- Further, this behavior extends to the execution of
concurrent commands in parallel at the CF. Partial
results of a command execution are not observable to other
commands while that command is still in progress. These atomicity
properties enable programs connected to the CF to
rely on the implicit serialization of command execution. This
eliminates the need for programs to obtain explicit multisystem
software serialization in order to execute a single command, such as
inserting a work element onto a shared queue.
- While the ensuing discussions focus on one or more systems
connected to a single CF, it is generally
anticipated that two CFs will be configured to
provide redundancy. OS/390 provides a recovery
service to exploiting programs to coordinate the repopulation of the
contents of a CF structure into an alternate
CF, for either failure or planned reconfiguration.
Coupling support facility. Specialized hardware provided on
each processing node in the Parallel Sysplex cluster is responsible for
controlling communication between the processor and the
CF. This specialized hardware is called a
coupling support facility, as depicted in Figure
3. The coupling support facility consists of new
S/390 CPU instructions, high-speed links, and link
microprocessors. It also utilizes processor memory to contain
local state vectors. These vectors are used to locally track
the state of resources maintained in the CF. As will
be seen, these local state vectors are key to avoiding unnecessary
communication between the processing node and the CF
to observe critical state information.
Figure 3
The coupling support facility provides several critical functions,
discussed next.
Coupling facility command delivery. The coupling support
facility provides the means by which a program sends commands to the
CF to request that locking, caching, and queuing
actions are to be performed. The coupling support facility supports
both synchronous and asynchronous modes of command delivery.
Synchronous commands are completed at the end of the
CPU instruction initiating the command, based
on highly optimized, low-latency transport protocols. Asynchronous
commands are completed after the CPU instruction
initiating the command is ended, with the completion notice being sent
to the operating system via a new notification mechanism that avoids
the necessity of raising a processor interruption.
Secondary command execution. The coupling support facility
executes secondary commands that are sent by a CF to
the processing node as part of performing certain command operations at
the CF. With one exception, the secondary commands
direct the coupling support facility to update state information in the
local state vectors to reflect updated resource status at the
CF. A secondary command may, for example, store an
invalid-buffer indication at a processing node to signal that the node
no longer has the latest version of a locally cached data item.
Local state vector control. The coupling support
facility introduces a set of CPU instructions that
interrogate and update local state vectors. A DEFINE
VECTOR instruction dynamically allocates, deallocates, or
changes the size of a local state vector. The vectors are in protected
storage and are only accessible via a
coupling-support-facility-assigned unique token. This ensures that
programs do not inadvertently overlay vectors for which they have no
access authority. Instructions are provided to test and manipulate
bits in the state vectors conveying the state of associated
resources, and are described in the context of their use. There are
three kinds of local state vectors used: (1) Local cache vectors are
used in conjunction with CF cache structures to
track local buffer coherency; (2) list-notification vectors are used
with CF list structures to provide
notification of CF list empty/nonempty state
transitions; and (3) list-notification vectors are also employed by the
coupling support facility to indicate the completion of asynchronous
command operations. Usage scenarios for each of these types of vectors
are described later in sections on cache structures, list structures,
and command delivery.
Hardware-assisted system isolation. The coupling support
facility also provides a system fencing function that
isolates a failing system node from being able to access shared
external resources during cluster fail-over recovery scenarios. This
capability is discussed further later in this paper.
We discuss a detailed architectural review of the coupling support
facility functions later in this paper.
PART II
Coupling facility architecture
This section introduces three types of CF
storage structures that are used to enable high-performance, highly
scalable, read/write data sharing across a Parallel Sysplex cluster. We
discuss the features of CF lock, cache, and list
structures and outline the software-controlled caching
protocols that are implemented using CF cache
structures.
Lock structures. The CF lock model supports
high-performance, finely grained lock resource management, maximizing
concurrency and minimizing communication overhead associated with
multisystem serialization protocols. This model enables a specialized
lock manager (e.g., a database lock manager) to be extended into a
multisystem environment.
The CF lock structure provides a hardware-assisted
global lock contention detection mechanism for use by distributed lock
managers, such as the IMS Resource Lock Manager. The
lock structure supports a program-specifiable number of lock table
entries that are used to record shared or exclusive interest in
software locks that map via software hashing to a given
CF lock table entry (see Figure 4).
Interest in each lock table entry is tracked for all peers connected to
the CF structure across the systems in the sysplex.
Each entry has a global byte to contain the system identifier of the
first system to register exclusive interest in any of the lock resource
names that hash to that lock table entry, and a share bit string that
identifies, by position, systems that have share interest in that hash
class.
Figure 4
OS/390 provides locking services to obtain, release,
and modify lock ownership state information for program-specified lock
requests. To request lock ownership, a program passes the software lock
resource name, the hash class value (to use as the index to the
coupling facility lock table entry), the shared or exclusive interest,
user data (used to negotiate protocol-specific hierarchical lock
ownership states), and program-specified lock information (recorded in
the entry for use in recovery processing). If the system does not
already have a registered compatible interest in the specified lock
table entry, OS/390 will issue a command to the
CF to perform the registration.
Through use of efficient hashing algorithms and granular serialization
scope, false lock resource contention is kept to a minimum. This
allows the majority of requests for locks to be granted
synchronously (CPU-instruction-synchronously) to the
requesting system, where synchronous execution times are measured
in microseconds. Only in exception cases involving lock contention is
lock negotiation required, wherein the CF returns
the identity of the systems currently holding locks in an incompatible
state with the current request to enable selective cross-system
communication for lock negotiation.
OS/390 provides a rich set of cross-system lock
management services to coordinate lock contention negotiation and
resolution, lock request suspension and completion, and recording of
persistent lock information in the CF. In the event
of system or lock manager failure, other systems can interrogate the
recorded recovery information for the failing system to quickly
determine the set of locks held at the time of failure, enabling
efficient lock recovery. The CF lock structure and
supported protocols are discussed in detail in
Reference 18.
Cache structures. A CF cache structure
provides the functions needed for multisystem shared-data cache
coherency management. The purpose of this model is to enable an
existing buffer manager (e.g., a database buffer manager) to be
extended into a clustered system environment. It permits each system
node to locally cache shared data in processor memory with full
data integrity and optimal performance. Additionally, data can be
optionally cached globally in the CF cache structure
for high-speed local-buffer refresh. As a global shared cache, the
CF can be viewed as a second-level cache between
local-processor memory and shared disk in the storage hierarchy.
A CF cache structure contains a global buffer
directory that tracks multisystem interest in shared-data items cached
in one or more system local buffer pools. A separate directory entry is
maintained in the CF structure for each uniquely
named data item. A directory entry is created the first time a command
requests registration of interest in the data item or a write operation
is performed to place the data item into the shared cache. The
directory entry contains control information for the data item used in
execution of CF commands targeting that entry. For
example, the directory entry contains the program-provided unique name
of the data item (which serves as the means for finding the directory
entry via internal hash on the name on cache structure commands). Also,
the directory entry contains a user registry identifying each system
that has a valid registered interest in that data item, along with the
local cache vector index being used to track the interest each database
manager has in the data item cached in its local buffer pool. The
directory entry contains an internal pointer to the
CF-cached version of the data item if present, as
well as a bit indicating whether the data item is cached in a changed
or unchanged state with respect to the permanently stored version of
the data item on shared disk (see Figure 5).
Figure 5
The CF cache structure architecture was designed to
support three basic caching protocols:
- Directory-only cache. A directory-only cache utilizes the global
buffer coherency tracking mechanisms provided by the
CF, but does not store data in the cache structure.
This allows read/write sharing of data with local buffer coherency, but
refresh of down-level local copies of data items is via access to the
shared disk containing the data item, and all updates are written
permanently to disk as part of the write operation.
- Store-through cache. When used as a store-through cache, in addition to
the global buffer coherency tracking, updated data items are written to
the cache structure as well as to shared disk. The directory entries
for these data items are marked as unchanged, since the version of the
data in the CF matches the version hardened on disk.
This enables rapid buffer refresh of down-level local buffer copies
from the global CF cache, avoiding
I/Os to the shared disk.
- Store-in cache. When used as a store-in cache, the database manager
writes updated data items to the CF cache structure
synchronous to the commit of the updates. This protocol has additional
performance advantages over the previous protocols as it enables fast
commit of write operations. However, here the data are written to the
cache structure as changed with respect to the disk version of the
data. The database manager is responsible for casting out
changed data items from the global cache to shared disk as part of a
periodic scrubbing operation to free up global cache resources for
reclaim. Further, an additional recovery burden is placed on the
database manager to recover changed data items from logs in the event
of a CF structure failure.
Reclaim processing. The CF cache
architecture provides commands and processes to efficiently manage
shared cache directory and data resources. Each directory entry in the
CF cache (and related data when present) is
associated with a program-specified storage class when the directory
entry is created. When read or write command references are made to a
named data item being tracked or cached in the CF,
that entry is marked by the CF as being the most
recently referenced entry for the storage class. Directory entries are
maintained in the storage class in least-recently-used
(LRU) order for purposes of reclaiming unchanged
directory and data resources from the cache to satisfy new resource
requests. Multiple storage classes in the CF cache
allow programs to group data sets being cached according to performance
class priority, and commands are provided to direct
CF resource reclaim algorithms in accordance with
the priorities established for the storage classes.
CF directory and data reclamation for unchanged data
items is performed automatically by the CF in
response to demand. If it is necessary to reclaim an aged directory
entry to satisfy a new request and there is registered interest being
actively tracked for one or more connected programs in the targeted
entry, cross-invalidate signals are directed to the local cache vectors
for those programs to reflect the fact that their interest is no longer
being tracked. Note that the CF does not perform
reclaim processing for changed data items in the cache structure.
Castout processing for changed data items. To facilitate use
as a store-in cache, the CF mechanisms allow
efficient retrieval of changed data items from the cache so that they
can be written to disk rendering them unchanged and available for
subsequent reclaim. The directory entry contains a castout class field
used to group changed data items together on common castout class
queues (program-specified) so that physically coresident data items can
be retrieved and written to the same disk volume in a single
I/O operation. Refer again to Figure 5.
Further, each directory entry contains a castout lock that prevents
multiple program processes from casting out the same data item to disk
concurrently. Failure to provide this mechanism could result in
interleaved write updates being cast out to disk out of sequence with
respect to the order in which the updates were made to the
CF cache entry. The castout lock is set during
execution of a read-for-castout command that marks the data item
unchanged and returns the data to the program. Note that the data item
is not eligible for CF reclaim while the
castout lock is held. When the program completes the disk
I/O, it issues an unlock-castout-lock command to
cause the CF to release the castout lock, rendering
the data item eligible for reclaim.
However, it is desirable to allow new write operations to continue to
make updates to a CF directory entry concurrent with
castout processing for that entry. Thus, the architecture enables
writes to an entry to store updated data while the castout lock is held
by another program process. Data integrity is preserved by setting the
change bit for the entry on again, which will persist when the castout
process releases the castout lock (i.e., the data item will not be
eligible for reclaim when the castout lock is released).
Reference 19 contains
greater detail about these processes and how they
relate to exploitation of a coupling facility by IBM's
DB2.
Read scenario. In order to describe how the
CF supports protocols enabling distributed local
caches to maintain coherency with respect to one another, it is best to
walk through two scenarios. First a read scenario is discussed,
followed by a write scenario.
Refer to Figure 6 for the following discussion. When a
database manager, such as IBM's DB2, first connects
to a CF cache structure via
OS/390 system services, the operating system
allocates a local cache vector in protected processor storage on behalf
of the database manager. The local cache vector is used to track the
coherency of data cached in the local buffer pool.
OS/390 passes the local cache vector token to the
CF as part of attaching the program user
(DB2) to the cache structure, as previously
described in the section "coupling facility." The database manager
associates each buffer in the buffer pool with a unique bit position in
the local cache vector. When the database manager receives a request
for access to a data item (named "A" in this scenario), it acquires
a lock on the data. The lock may be a global lock obtained through
access to a CF lock structure, for example. Next,
the program attempts to locate "A" in the local buffer
pool at Step 1. If "A" is located, then the currency of the locally
cached copy of "A" needs to be determined. This is accomplished
using a TEST VECTOR ENTRIES instruction in
Step 2, passing the vector token and the local cache vector index for
that local buffer as input to the instruction. The TEST
VECTOR ENTRIES interrogates the vector in protected
processor storage and sets a condition code indicating whether the
local copy of "A" is valid or invalid (down level). Note that this
check is a processor storage reference and involves no communication
with the CF. If the locally cached copy of "A"
is valid, it is returned to the requestor from the local buffer pool
and the lock on "A" is released.
Figure 6
If "A" is not in the local buffer pool or the cached copy was
invalid, the program assigns a buffer in the pool to contain the data
item. Then, at Step 3, the program issues a read-and-register command
to the CF to register its interest in those data
with the CF, passing the program-specified data item
name and the local cache vector index associated with the local buffer
where the data item is being cached. In addition, the program can
provide the name of the old data item that was cached in the assigned
buffer before it was reassigned to contain "A" as input to the
command, for example "B." OS/390, as part of
passing the command to the CF, first sets the
specified local cache vector bit optimistically to the valid state via
a SET VECTOR ENTRY instruction. Upon receipt of the
read-and-register command, the CF finds or assigns a
directory entry for data item "A" and updates the user registry for
the requesting connected program user (U1), saving
the local cache vector index and marking the user as having a
registered interest in "A." If the data for "A" are present in
the CF cache from a prior write operation, the data
are returned to the program and stored in the local buffer pool as part
of the command execution. Also, if the "old" named data item
"B" has a current directory entry present in the cache structure
and it still reflects U1 as being validly registered
for that data item with the same local cache vector index being
tracked, then U1's interest in "B" is
deregistered, as the local cache vector index is now being used to
track interest in "A." If the read-and-register command fails for
any reason, the operating system issues a SET VECTOR
ENTRY instruction to reset the target local cache vector bit
to the invalid state.
If the CF did not have a copy of the data in its
cache, then the program issues an I/O to retrieve
the data item from disk at Step 4. If the program desires to place the
unchanged data item into the CF cache so that it may
be fetched subsequently for rapid buffer refresh when a local read miss
occurs, a write-when-registered or write-and-register command is issued
to store the data item at the CF in Step 5. At this
point the data item can be returned to the requesting program and the
lock on "A" released.
Write scenario. Refer to Figure 7 for the
following discussion. Assume here that a request is made to the
database manager to update data item "A." As before, the database
manager locks and locates "A" in its local buffer pool and tests
the validity of the locally cached copy. The program uses the local
copy if current or retrieves a current copy if not, as described in the
previous section. Then, at Step 1, the program updates the local copy
of data item "A."
Figure 7
At Step 2, if it desires to store the updated data in the
CF, the program issues a write-when-registered
(WWR) or write-and-register (WAR)
command to the CF, passing the data and the local
cache vector index being used to track interest. If the program intends
to write the data to disk as part of a store-through caching protocol,
then an indication is specified on the write command to set the change
bit as "unchanged" in the directory entry for "A." If the
protocol is to use the CF cache as a store-in cache,
then the change bit setting is designated to the "changed" state.
The difference between the WWR and
WAR commands is that the WAR
command will allocate a directory entry for "A" if one is not
present and will unconditionally over-write the existing data for
"A" if already present in the CF (on the
presumption that the program holds an exclusive lock on the data item
"A" and knows it has a current copy). The WWR
command conditionally performs the write operation only if the writer
is currently registered at the CF as having a valid
local copy of "A." This capability is important for programs
holding a lock on a specific record within data item "A," but not a
lock on the entire data item. The validity check at the
CF entry on the WWR command
ensures that concurrent updates to different records associated with
the same data item cached in the CF cannot result in
one system writing a down-level version of the data item into the
CF. Without this validity check, a program could
test its local cache vector index contents as being valid and then
proceed to update the local copy, missing the cross-invalidate signal
issued on behalf of another update to a different record just
after the test of the local vector bit.
Alternatively, if the CF cache is being used solely
to provide cluster-wide buffer coherency tracking as part of a
directory-only caching protocol, an invalidate-complement-copies
(ICC) command is issued to the CF
at Step 2 instead of a write command to cause the cross-invalidate
function to be performed without storing data in the
CF cache for "A."
At Step 3, as part of execution of the WWR,
WAR, or ICC command at the
CF, the user registry for "A" is checked to
determine whether there are any other connected users who have a valid
interest in "A," meaning that they have a locally cached copy of
"A" which still reflects the valid state. If so, the
CF marks those users as invalid in the user registry
and then sends a cross-invalidate command via the coupling links in
parallel to those systems having a registered interest in that data
item. The CF issues the cross-invalidate command,
specifying the local cache vector token and local cache vector index
uniquely identifying the specific vector and bit which is to be
manipulated on the attached processor node. Specialized coupling link
hardware provides processing for buffer invalidation signals sent by
the CF to attached systems. The coupling
support facility link microprocessor receives the cross-invalidate
command and updates the CF-specified bit in the data
manager's local cache vector to indicate the local copy is no longer
valid. This process does not involve any processor interruption or
software involvement on the target system. Work continues without any
disruption. After the CF has observed completion of
all buffer invalidation signals, it responds to the system that
initiated the data update process. Again, this entire process can be
performed synchronously
(CPU-instruction-synchronously) to the updating
system, with completion times measured in microseconds.
At Step 4, if the database manager has written the data item to the
cache structure as unchanged (store-through) or not at all
(directory-only cache protocol), then it will write the data item to
disk at this point. This step is bypassed if the CF
cache is being used as a store-in cache for fast commit of write
updates to avoid incurring disk I/O costs
synchronous to mainline program processing.
At this point, the issuing database manager is free to release its
serialization on the shared-data item.
By exploiting the cache coherency and global buffer cache management
mechanisms previously described, it can be seen that the
CF and related S/390 Parallel
Sysplex cluster technology provide the basis for high-performance,
scalable read/write data sharing with integrity across multiple
systems, avoiding the message-passing overheads typically associated
with data-sharing parallel systems.
Queue (list) structures. The CF queue or
list structure supports general-purpose multisystem queuing constructs
that are applicable for a wide range of uses, including workload
distribution, intersystem message passing, and maintaining shared
control block state information. As depicted in Figure
8, a list structure includes a program-specified number
of list headers. List structures can support queuing of entries in last
in, first out/first in, first out (LIFO/FIFO) order
or in collating sequence by key under program control.
Individual list entries are dynamically created when first written and
queued to a designated list header. List entries can optionally have a
corresponding data block attached at the time of creation or subsequent
list entry update. Existing entries can be read, updated, deleted, or
moved between list headers atomically, without the need for explicit
software multisystem serialization in order to insert or remove entries
from a list. Compound operations are supported, such as
read-and-delete, write-and-move, etc.
Figure 8
Optionally, the list structure can contain a program-specified number
of lock entries. When so specified, the structure is referred to as a
serialized list structure. In the serialized list structure,
locks are obtained in an exclusive mode only. The individual locks are
solely under software control and do not architecturally map to any
other list objects; however, it is common to map a given lock entry to
a list header (queue) in the list structure. Lock operations include
the ability to obtain ownership of a lock, release the lock, test
whether a specific lock is held, and execute a list command only while
a given lock is not held. A powerful construct of the list model is the
ability to combine a locking operation with a queuing operation to the
list structure in a single compound command, using the success of the
locking operation as a condition for execution of the queuing action. A
common exploitation of the serialized list structure is to request
conditional execution of mainline CF commands as
long as a specified lock is not held. Recovery operations requiring a
static view of a list or the entire structure can set the lock causing
mainline operations to be rejected. Such a protocol avoids the
necessity for mainline processes to explicitly gain or release the lock
for every request, but still allows such requests to be suspended or
rejected in the presence of long-running recovery operations.
OS/390 supports the ability to either suspend a
serialized list request if the requested lock is not available, or to
conditionally obtain the lock and return control to the program if the
lock is not immediately available.
There are several mechanisms by which a list entry can be accessed,
depending on structure attributes specified as part of list structure
allocation. Entries can be accessed by a program-provided key, which is
also used to queue the entries collated in keyed sequence on a given
list. Note that multiple entries of the same key can reside on the same
list. Alternatively, list entries can be accessed by a program-assigned
name, which is guaranteed to be unique across the list structure when
the entry is created. List entries can always be accessed in
LIFO/FIFO order from the head or tail of the list.
Further, all list entries are created with a CF
assigned list-entry identifier (LEID). The
LEID is guaranteed to be unique for the life of the
list structure and provides a direct means of locating an individual
list entry even if it is not otherwise tagged with a key or name.
Each list header in the structure has a set of list controls associated
with it. The controls contain threshold values for the number of list
entries or data elements that can reside on a given list header, so
that a single program user cannot exhaust all of the list structure
resources as a "runaway" rogue program. The list controls also
contain a list cursor value, which enables multiple concurrent programs
on different systems to cooperatively browse through a list. Each
program reads the entry adjacent to the last one read by any peer
program, without each system having to communicate with respect to the
current cursor position within the shared list. The list controls
further contain list assignment key controls, whereby the program can
seed the initial and maximum key values so entries created on a list
can be assigned a generated key in sequence by the
CF, without the program having to know the last key
assigned on list entry creation by a peer program on another node.
One of the controls, a list authority value (LAU),
can be set by a program dynamically and used as a comparative operand
on list structure commands directed to the targeted list, causing
commands to be rejected if the comparative check on
LAU fails. This is a useful mechanism to change list
ownership or state with guaranteed failure of any commands issued by
peer programs unaware of the changed ownership or state for that list.
Other list structure objects can be atomically compared or replaced as
part of list structure command executions to cause conditional
execution only if all comparative checks succeed. In addition to the
LAU check, execution can be conditional based on
successful compare or replace of lock value, list number, or version
number. Every individual list entry supports a version number value
that is initialized and modified by the programs and can serve as a
means of reflecting any list entry state change (such as update of the
list entry data contents).
Refer to Figure 9 for the following discussion of list
notification. Programs can register interest in specific list headers
used as shared work queues or in-bound message queues at the
CF, for the purpose of being notified when a
monitored list becomes nonempty. This provides initiative to the
program to issue commands to retrieve list entries that have been
placed on the list. The program registers interest in monitoring a
specific list via a list structure command, register-list-monitor,
passing the list-notification vector index to be used to track interest
in that list, as indicated in Step 1 in Figure 9.
Figure 9
When an entry is added to the list causing it to go from an empty to
nonempty state, as at Step 2, the CF sends a list
notification command indicating an empty-to-nonempty list state
transition to registered users at Step 3. The list-notification
(LN) vector token (passed in on the initial attach
command when the program connected to the list structure) is provided
along with the LN vector index on the
list-notification command. The command is received by the coupling
support facility link microprocessor on the target system and the
specified list-notification vector bit and associated list vector
summary bits are updated to reflect the list-state-transition, as will
be described next.
Each list-notification vector has a local-summary
(LS) bit that indicates the overall contents of the
vector as either inactive (all vector bits are set to ones indicating
empty list state) or active (at least one bit is reset to zero
indicating nonempty list state). There is also one global summary
(GS) bit for the processing node; it indicates the
overall contents, either inactive (all vectors are inactive) or active
(at least one vector is active), for all of the list-notification
vectors at the node.
The coupling support facility first sets the specified
LN vector bit to the nonempty state. Then the local
summary bit for that vector is set to the active state. Finally, the
global summary bit for the node is set to the active state. Setting the
local summary and the global summary to the active state serves as the
means for the operating system to observe the fact that an
LN signal has been received; this is detected during
normal dispatcher processing when looking for new work units to
dispatch.
As with the cache buffer invalidation signal handling, there is no
processor interruption, processor cache disruption, or software task
context switch caused as a result of processing the list state
transition command.
The program steps in polling for list nonempty state transitions are
(1) test the global summary, then (2) test the local summary if
necessary, and finally (3) test individual vector bits to identify the
specific lists that have transition to a nonempty state.
The first test is made by the dispatcher routine of the operating
system; if no vectors are active, normal dispatcher processing
continues.
Tests of the summary bits use the TEST VECTOR
SUMMARY instruction. TEST VECTOR
ENTRIES examines bits in a list-notification vector.
Summary bits are placed in the inactive state using the SET
VECTOR SUMMARY instruction in response to observing that one
or more vectors has been placed into the active state during dispatcher
polling. First the global summary is reset. Then the local summary bit
is tested and reset if necessary. This is done by the operating system
prior to proceeding with testing of the state of individual list vector
entries, so as not to lose dispatching initiative for subsequent
list-notification events.
Once the operating system has determined that an LN
vector has experienced at least one empty-to-nonempty list state
transition, it proceeds to drive each target user's list transition
exit at Step 4. The user exit routine then executes the TEST
VECTOR ENTRIES instruction to determine which lists
have entered the nonempty state at Step 5.
Note that when the last entry on a CF list is
deleted, list-notification commands signaling a nonempty-to-empty-state
transition are sent to registered connected programs. The
GS and LS summary bits are not
altered as part of a nonempty-to-empty-state transition. The specified
LN vector bit is set to indicate the empty state of
the list at the CF.
Given a responsive operating system polling means, the above mechanism
avoids the undesired overhead of processor interruptions during
program execution and the corresponding cache disruption effects that
ensue at points in processing where the dispatcher is not intending to
preempt the CPU to dispatch another unit of work.
Summary of the CF architecture. From the functions previously
described, it can be seen that the CF provides a
rich and diverse set of capabilities upon which programs can build
efficient, reliable, and scalable protocols for sharing data in a
clustered system. Highlighted functions and design characteristics
include:
- Global concurrency controls and hardware-assisted lock contention
detection
- Global buffer coherency controls for distributed caches
- High-speed shared cache with CPU-synchronous access
- Shared queues for workload distribution and message passing
- Cross-invalidate signal delivery without processor interruption or
global broadcast required
- Local processor mirroring of global shared-resource state via local
state vectors
- Atomic CF command properties to minimize software
serialization requirements and simplify recovery processes
Coupling support facility architecture
This section outlines several aspects of the coupling support
facility architecture.
First, the SEND MESSAGE instruction is described.
The instruction is used to deliver a command request to a
CF from an attached processor node. Next, links
between a coupling support facility and a CF are
considered. These links carry command and response information, as well
as cross-invalidate and list-notification commands from the
CF. Finally, system fencing functions are described.
Command delivery. An exchange of command and response
information between a coupling support facility and a
CF is called a message operation. It is important to
distinguish this mechanism from message-passing protocols between
software programs on different nodes of a cluster or communication
flows in a networked environment. In the context of this discussion, a
message is the transport unit for exchanging commands and
responses with CF microcode over a high-speed link,
with an architecture for the express purpose of supporting efficient
data-sharing functions across nodes of the Parallel Sysplex cluster.
When an operating system invokes the operation, the command information
is specified in main storage; it includes a command code, operands, and
output data for a write-to-CF command. Response
information is placed in main storage to summarize the results of
command execution and input data are also stored for a read
command.
The program issues SEND MESSAGE to start a message
operation (see Figure 10). The instruction designates a
message subchannel and a message-operation block in main storage. The
subchannel is associated with a specific CF and
identifies the links (there can be several) that may be used for the
operation. OS/390 activates as many message
subchannels as can be effectively used for parallel execution of
multiple CF commands.
Figure 10
After the coupling support facility selects a link for communication,
the operation is performed by sending the command to the
CF, transferring data as appropriate, decoding and
executing the command, formulating a response, and storing
response information in main storage. While executing the command, the
CF may send secondary commands to one or more
processing nodes.
The message-operation block. Figure 10 illustrates parameters
for this operation:
- Asynchronous (A)--When the A bit is one, the message operation is
performed asynchronously to continued instruction processing--the
SEND MESSAGE instruction is completed before the
command reaches the CF. Otherwise,
CPU instruction processing is delayed and the entire
operation is performed during the execution of SEND
MESSAGE.
- Notification (N)--When the N bit is one, the list-notification vector
bit designated by the notification descriptor is reset to signal the
completion of the operation.
- Message-command-block address and command length--These are the
main-storage locations of a coupling command and the number of bytes in
the command.
- Data buffer descriptors--These are the main-storage locations and sizes
of the data buffers used by the command. The aggregate data area can
contain up to 64 kilobytes (KB). The buffer contents
are sent to the CF when the write bit in the
message-command block is one; the CF returns data to
the buffers when the write bit is zero.
The message-command block. This contains information that is
sent to the CF:
- Command code--This specifies the command to be performed.
- Write (W)--When the W bit is one, a write operation is
performed--information is transferred from the data buffers to a
CF structure. Otherwise, a read operation is
performed--information is transferred from a CF
structure to the data buffers.
- Command information--These are operands that complete the command
specification.
The message-response block. This is the destination for
information that is returned by the CF. It starts at
the location immediately following byte 255 of the message-command
block. The following are stored in the block:
- Response count--This is the number of meaningful bytes stored in the
message-response block. The count spans information stored starting at
byte 0 of the block. The information includes the response count, the
data count, and the response field.
- Data count--This is the number of meaningful bytes stored in the
data buffers. The data count is zero when the write (W) bit in the
message-command block is one.
- Response--This is information summarizing the results of command
execution.
Asynchronous vs synchronous operation. In contrast to an
I/O operation with a disk or network device, which
takes many milliseconds to complete and is always performed
asynchronously to continued instruction processing, a coupling support
facility message operation is performed synchronously or asynchronously
to instruction processing, depending on the option selected by the
program.
A general guideline is to use synchronous operation for commands that
transfer at most 4 KB of data (not counting the
bytes in the message-command block). Most frequently used commands (for
example, locking commands), commands that enqueue or dequeue work
requests or messages, and commands that read or write 4
KB of data from or to a cache structure, satisfy the
guideline.
Commands that transfer more than 4 KB of data or are
otherwise known to be long-running should use the asynchronous option.
Other work can be processed while the command is being executed.
Completion of the message operation. No I/O
or other interruption is generated for a message operation. This design
reduces processor overhead. For example, an interruption at the end of
a disk operation normally stops the processing of a higher priority
task, invokes an interruption handler to save the machine state, causes
a lower priority work request to be placed on a system queue, results
in castouts from caches and translation-lookaside buffers, and restores
the old machine state to return to the interrupted task. This
disruption is avoided using the techniques described next.
When the program selects the synchronous option for a message
operation, control is returned at the end of the operation (end-op) of
the SEND MESSAGE instruction with the message
operation completed. Status of the operation is then determined as
indicated in the condition code for a TEST MESSAGE
instruction.
When the program selects the asynchronous option, it can designate a
list-notification vector bit that is to be reset when the operation is
completed. The operating system tests for completion when, in the
normal course of events, it is searching for a new unit of work to
dispatch.
Notification of asynchronous message completion. The coupling
support facility exploits list-notification local state vectors to
signal asynchronous message completions to the operating system. List
notification vectors were previously introduced. The operating system
establishes a separate completion vector for each CF
to which the processor is connected. Each bit in a given vector is
associated with a different message subchannel used for communication
with that CF. The operating system issues a
DEFINE VECTOR instruction to set up a
list-notification vector in protected processor storage. The coupling
support facility assigns a list-notification token to serve as the name
for the vector; the token is used in various CPU
instructions and coupling commands. A vector that indicates the
completion of message operations is shown in Figure 11.
Figure 11
Each list-notification vector has a local-summary
(LS) bit that indicates the overall contents of the
vector as either inactive (all vector bits are set to ones) or active
(at least one bit is reset to zero).
There is also one global summary (GS) for the
processing node; it indicates the overall contents, either inactive
(all vectors are inactive) or active (at least one vector is active),
for all of the list-notification vectors at the node.
The coupling support facility sets the local and global summary
bits to the active state after it resets a vector bit to indicate the
completion of a message operation.
The program steps in polling for the completion of asynchronous
operations are to test the global summary first, then test the local
summary if necessary, and finally test individual vector bits to
identify the completed operations. The first test is made by the
dispatcher routine for the operating system; if no vectors are active,
normal dispatcher processing continues.
Tests of the summary bits use the TEST VECTOR
SUMMARY instruction. The TEST VECTOR
ENTRIES instruction examines bits in a
list-notification vector.
List-notification vector bits are set using the SET
VECTOR ENTRY instruction; this is done by the
operating system as part of initiating an asynchronous SEND
MESSAGE operation. Summary bits are placed in the inactive
state using the SET VECTOR SUMMARY instruction in
response to observing that one or more vectors have been placed into
the active state during dispatcher polling. This is done by the
operating system prior to testing the individual vector bits for
completed operations, so as not to lose dispatching initiative. Once
the operating system has determined that one or more subchannels have
completed execution of a message operation, it proceeds to execute the
TEST MESSAGE instruction to observe status for those
requests.
As discussed earlier for list notification, the above mechanism avoids
the undesired overhead of processor interrupts during program execution
and the corresponding cache disruption effects that ensue at points in
processing where the dispatcher is not intending to preempt the
CPU to dispatch another unit of work.
Links between the coupling support facility and the CF. A
connection between a coupling support facility and a
CF is called a CF link. The links
provide transfer rates of 100 megabytes per second with low-access
latency.
Each link is arranged to provide two information flows. Information in
one flow is sent from the coupling support facility to the
CF. Information in the other flow is sent from the
CF to the coupling support facility. The information
in the flows need not be associated with the same coupling command.
A number of message operations may be executed concurrently on a single
link. The operations are split into short intervals of time
during which only a segment of information is transferred over the
link. The intervals are sequenced in response to demands made by the
coupling support facility and the CF.
Buffers at each end of the link contain areas for command information,
data, and response information. They are allocated for use on a dynamic
basis to compensate for speed mismatches among the link, the coupling
support facility, and the CF. Figure
12 shows an example of a CF link
with its two information flows. There are four buffers at each end of
the link. The figure suggests that one of the write commands, together
with the data to be written, have been sent from the coupling support
facility to the CF, which has not yet completed the
command. At the same time, one of the read commands has entered a
buffer at the coupling support facility, and has just started to cross
the link. Both commands were invoked by the operating system.
Figure 12
Independently of the commands sent by the operating system, the
CF has sent secondary list-notification
(LN) and cross-invalidate commands
(XI) to the coupling support facility as part of the
execution of coupling commands that were received from other processing
nodes (not shown). The LN command is being executed
by the coupling support facility link microprocessor. The
XI command is "in flight" over the link to the
coupling support facility.
A response for each command will be returned when command execution is
completed.
System fencing. Key to cluster availability is the means to
"failover" applications to a healthy node when the node on which
they are running is deemed to be failing. In order to recover resources
owned by the failing node, that node has to be reliably known to be in
a terminated state so that it can no longer access shared resources.
The CF and coupling support facility provide the
means to isolate a processor from accessing any shared resources in the
cluster (i.e., to "fence" it) so that cluster recovery can take
place.
As part of an availability failover protocol, each
OS/390 system periodically broadcasts "heartbeat
signals" to the other operating systems of the cluster. When signals
are missed, indicating that a system has probably failed, a peer system
(any of the remaining healthy nodes) assumes recovery responsibility
for any resources held by the failing system. However this does not
guarantee that the faulty system is actually in a terminated state. It
could be in a temporarily hung state or looping-disabled for an
excessive period of time. The recovery system must cause the failing
system to become isolated from the cluster before it takes recovery
actions, which may include completing or backing out transactions for
the failing system and releasing its database locks. Then, the workload
of the failing system is distributed to other systems.
Isolation from the cluster is achieved by establishing a channel
subsystem state to screen the I/O and message
operations of a processing node. The not-isolated state is set when the
node is initialized; when the state changes to isolated, any new
I/O or message operations initiated from the
isolated node are rejected by its channel subsystem.
Figure 13 shows an isolation scenario. First, an
operating system sends an activate-fencing command to initialize the
fencing function at its processing node. The command is sent by way of
a CF; it stores a nonzero fencing-authority value at
the node. The operating system also distributes the authority value to
peer systems in the cluster.
Figure 13
When heartbeats are missed for a period of time in excess of a
predetermined failure interval, another system in the cluster can take
action to partition the failing system from the sysplex. The recovery
system issues an isolate command via the SEND
MESSAGE instruction to interdict any I/O
and message operations attempted by the failing system. The command
specifies a fencing-authority test value; it is forwarded by the
CF to the coupling support facility at the failing
node as indicated on the isolate command.
The coupling support facility executes the isolate command, as follows.
When the fencing-authority value at the node is nonzero and matches the
fencing-authority test value, the channel-subsystem state is set to
isolated and an I/O-termination process is started.
A response to the isolate command indicates whether or not all active
I/O and message operations have ended; if they have
not, the termination process continues at the failing node and the
takeover system reissues the command until a response indicates that
all operations have ended. If all operations have not completed in a
reasonable time period, the recovery system can reissue the isolate
command, specifying that the I/O termination process
should terminate long-running I/O operations at a
channel control word (CCW) boundary. If a
program-determined period of time expires again without completion of
the isolation process, the recovery system reissues the isolate command
specifying immediate termination of any still-outstanding
I/O operations. This will terminate any apparently
hung CCW operations. In this manner, the system
isolation process is executed to allow quiescing of outstanding
I/O operations if possible so as to not leave shared
resources in an indeterminate state of completion. In addition, the
system isolation process causes reset of channel interfaces from the
target system so that any serialized state information maintained in
shared-disk controllers (such as device reserves, etc.) are released.
Once the response from the isolate command indicates that all
I/O operations have been completed or terminated,
the failing system has been isolated from the cluster. Resource
recovery and workload redistribution can proceed on other systems in
the Parallel Sysplex cluster.
Summary of coupling support facility architecture. The
coupling support facility architecture provides a set of essential
functions in the Parallel Sysplex cluster. They are:
- Efficient command transport for communication with the
CF
- CPU-synchronous command delivery and execution
- Asynchronous command completion without I/O
interruption
- CPU instructions for manipulation of local state
vectors and local tracking of CF resource state to
minimize unnecessary signaling traffic between nodes
- System isolation functions to support robust failover protocols
Parallel Sysplex scalability
Figure 14 depicts effective total-system capacity
as a function of the number of physically configured
CPUs in a processing system. The line labeled
IDEAL shows a 1:1 correspondence between physical
capacity and effective capacity. That is, as a CPU
is added to the processing system, its full uniprocessor capacity would
be effectively applied to program execution. Real configurations, of
course, do not exhibit this ideal behavior.
Figure 14
The symmetric multiprocessor (SMP) line shows
behavior of an SMP as additional
CPUs are added to the same single physical system.
SMP systems provide maximum effective throughput at
relatively small numbers of engines, but as more
CPUs are added to the SMP system,
incremental effective capacity begins to diminish rapidly, limiting
ultimate scalability. This is attributable to the overheads associated
with interprocessor serialization, memory cross-invalidation, and
communication required in the hardware to support conceptual sequencing
of instructions across CPUs, cache coherency, and
serialized updates to storage performed atomically to
CPU instruction execution. These processes are
performed in the hardware without the benefit of knowledge of software
serialization that may already be held on storage being manipulated at
a much more coarse level. In addition, SMP overheads
are incurred in the system software due to software serialization
and communication to manage common system resources.
The S/390 Parallel Sysplex scalability
characteristics are excellent. Physical capacity introduced to the
configuration via the addition of more data-sharing systems in the
sysplex (where each system can be an SMP or
uniprocessor) provides near-linear effective capacity growth as well.
Performance studies conducted in a Parallel Sysplex environment
consisting of multiple IBM S/390 model 9672
CMOS systems running a 100 percent data-sharing
CICS database control facility
(CICS/DBCTL) workload demonstrated an incremental
overhead cost of less than half a percent for each system added to the
configuration. In addition, the initial data-sharing cost associated
with the transition from a single-system non-data-sharing
configuration to a two-node data-sharing configuration was measured at
less than 18 percent. [16]
These results testify to the excellent scalability of the
S/390 Parallel Sysplex. This topic is discussed in
detail in Reference 10.
Conclusion
Several key design characteristics unfold when considering
fundamental properties desired in an ideal large-scale server system
capable of handling both current and emerging commercial application
workloads. One important attribute is the ability to leverage the power
of multiple processors to meet the processing capacity demands of
business-critical workloads. This leads to the need to treat these
multiple processors as a single large-scale computing resource from
several perspectives. Clients of the multiprocessing server want to
view the server system as a single node in the network. Applications
should be able to be executed seamlessly across the multiprocessing
system, accessing processing resources from whichever
CPU the application logic happens to reside on.
Systems administrators need the ability to manage the multiprocessing
system from a single point of control. To maximize system throughput
and provide consistent response times to mission-critical applications,
it is desirable to be able to direct arriving work requests for
execution on any processor having available capacity in a highly
responsive and dynamic manner. If the processing compute demands grow
and exceed the capacity of the existing server system, it is desirable
to add an additional CPU to the existing server
system and grow the application workload transparently, without
requiring workload splitting of customer applications across
processors or repartitioning of databases to dedicate
portions of the database to individual processors of the large-scale
server system.
Fundamental to satisfying all of the desired design characteristics
outlined is the ability to share data and processing resources across
the CPUs of the large-scale server system, without
significantly impairing performance in support of resource sharing.
This further requires that the multiprocessing server system is
designed to provide low-latency, high-performance global serialization
controls across its set of CPUs, as well as provide
the mechanisms to have multiprocessor coherency controls so that shared
data can be cached simultaneously in local processor memory of multiple
CPUs with guaranteed coherency properties intact.
Within limits, the symmetric multiprocessor (SMP) is
the multiprocessing building block, which has all of these design
characteristics, and is in the marketplace today. It has been in
existence in various forms in the information technology industry for
25 years, having evolved considerably in terms of capability and
sophistication over that period of time.
Unfortunately, the SMP does have critical
limitations that have driven the industry to search for yet a better
technology answer. The two fundamental shortcomings of an
SMP are its limits in both scalability and
availability. As CPUs are added to the
SMP, incremental capacity diminishes rapidly beyond
a relatively small number of CPUs, due to
interprocessor communication in support of concurrency and coherency
controls as well as software-related resource management costs.
Further, the SMP represents a single point of
failure, not only from a hardware perspective, but more significantly
from a software view as it runs a single version of the operating
system and supported applications.
These shortcomings and the ever-increasing demand for additional
processing capacity and improved availability for commercial-processing
workloads continue to drive the need to scale capacity beyond the
limits of a physical SMP system and exploit multiple
system nodes for both scale and availability. This has led to the
emerging prominence of clustered systems comprised of multiple
SMP or uniprocessor nodes. Clustered systems also
offer potential advantages in systems management economies-of-scale
given the relative homogeneity of systems within the cluster.
Typically, clustered systems provide high degrees of scalability by
partitioning workloads and related databases across the cluster nodes
to avoid the need for cross-node buffer coherency and serialization
controls, which can significantly compromise scalability beyond a
relatively small number of nodes if software-based message-passing
mechanisms are deployed to accomplish these functions. However, such
"shared-nothing" clustered system environments sacrifice key
desired characteristics of an ideal large-scale commercial server in
order to meet the scalability and availability objectives. Without
data-sharing capabilities characteristic of an SMP
server, it is not possible to dynamically balance work based on
processor capacity. Nor is it possible, for example, to add a node to
the cluster for additional capacity growth without having to split the
application or repartition databases, which require a cluster-wide
outage.
The S/390 Parallel Sysplex is an advanced commercial
processing clustered system, combining many attributes of an
SMP in terms of seamless access to multiprocessing
resources, with the scalability and continuous availability
characteristics of clusters. The Parallel Sysplex supports
high-performance multisystem read/write data sharing with local cache
coherency, enabling the aggregate capacity of multiple
OS/390 systems to be applied against common
workloads. This in turn facilitates dynamic workload balancing,
maximizes processor utilization, and provides consistent response
times. Further, through data sharing and dynamic workload balancing,
continuous availability and continuous operations
characteristics are improved for the clustered system, as nodes can be
dynamically removed or added to the cluster in a nondisruptive manner.
The Parallel Sysplex cluster technologies effectively address the
overhead issues typically associated with shared-data model
architectures, such as: global serialization message-passing protocols,
global broadcast cross-invalidate cache coherency protocols, and
intersystem "ping" between systems and shared
I/O devices. The Parallel Sysplex cluster
technologies integrate a comprehensive shared-data architecture model
with specialized hardware-assists and optimized software protocols to
provide a highly scalable and robust commercial parallel-processing
platform.
Key technology functions provided include:
- Global concurrency controls and hardware-assisted lock contention
detection
- Global buffer coherency controls for distributed caches
- High-speed shared cache with CPU-synchronous access
- Shared queues for workload distribution and message passing
- Hardware-assisted system isolation for system failover recovery
The Parallel Sysplex cluster is an integral part of the
OS/390 platform and is the foundation on which a
growing number of new subsystem and operating system enhancements are
based. With the maturation of the technology and delivery of sysplex
exploitation by the traditional on-line transaction processing and
decision support workloads well underway, the Parallel Sysplex focus is
shifting to support new application environments, such as commercial
parallel Web-server applications, and cluster-enabled object business
servers to distributed clients.
The S/390 Parallel Sysplex cluster represents the
next step in the evolution of large-scale commercial-processing server
systems.
*Trademark or registered trademark of International Business
Machines Corporation.
Cited references
Accepted for publication December 20, 1996.
|