|
Synchronization is a fundamental function required
in nearly any aspect of a computer system. It has been studied in
various research communities including distributed database systems,
basic operating system principles, parallel processing systems, and
centralized database systems. In parallel processing the
synchronization techniques must be extremely efficient to allow
frequent synchronization of parallel application programs (e.g.,
barrier synchronization used in a parallel DO
loop [1]).
A survey of various techniques can be found in
Reference 2.
Concurrency control in database systems has also been a
well-studied problem in both centralized databases
[3] and
decentralized systems. [4]
Kohler reviews many techniques for
concurrency control such as locking, time stamps, circulating permits,
tickets, conflicts, and reservations. [4]
Although many of
these techniques are quite different, the fundamental goals of
high-performance locking services are similar. This paper describes a
locking facility designed for the S/390* Parallel
Sysplex*; this facility is general purpose in nature and has specific
implementations that support database concurrency control systems.
The objectives of this facility are high availability, scalability, and
high transaction throughput for workloads that are both
update-intensive (high write-to-read ratios) and read-intensive (low
write-to-read ratios). Although there are many emerging systems with
strong claims for availability, scale, and throughput (see
Reference 5
for a comprehensive and recent survey and analysis), these systems do
not generally support high write-to-read ratios unless the workloads
are first partitioned across systems. It is claimed that a
"shared-disk" function (that is, all disks are accessible from all
processors) combined with a high-speed locking facility are essential
functions. Note that additional functions are also required (e.g.,
high-speed cache-like functions and sharedmemory
functions), [6,7] but this paper focuses on
locking. [1]
This paper is organized as follows. The next section describes the
system model and design objectives, and is followed by a section on an
overview of the locking services. The last section justifies many of
the technical claims through performance modeling and measurements.
System model and objectives
Parallel processing is increasingly being used in commercial
systems. The limitations of symmetric multiprocessors
(SMPs) have long been noted, [8] with
practical limits on the order of ten processors. However, demands for
processor capacity greatly exceed this capability and the advent of
advanced coupling technology [9-13]
and commodity technology
(e.g., processors, interconnects, disks) is driving these systems into
acceptance in commercial markets. An overview of these systems, in
addition to covering several key technical issues, can be found in
Reference 5.
There are four key objectives to the S/390 Parallel
Sysplex, specifically related to the use of parallel technology. These
are:
- High availability
- Scalability
- High transaction throughput
- High write-to-read ratios in the workloads
Underlying the parallel architecture is a shared-disk
architecture. In Reference 5,
various topologies for connecting disks
to processors are discussed and it is observed that many shared-nothing
architecture (i.e., disks partitioned among the processors)
implementations are emerging. It is claimed that to achieve a high
write-to-read ratio, not only must a shared-disk architecture be used,
but in addition, advances must be made in critical functions to support
data coherency among processors. These functions, embodied in the
S/390 coupling facility, [6,7] contain
functions for buffer sharing and invalidation, shared queue structures,
and locking. Not only does this system support high write-to-read
ratios but it can also achieve higher levels of success with respect to
availability, scale, and throughput. There are also benefits in areas
such as workload management, systems management, and similar issues
that are further discussed in Reference 6.
The primary workload is high-throughput transaction-processing systems.
In addition, the workloads contain relatively low database contention.
Low response time is critical for high-throughput transaction
processing, and locking is a critical component. This can be
illustrated with a simple use of Little's Law
[14] that
states the multiprogramming level is a product of the arrival rate and
the response time. For example, assume an arrival rate of 100
transactions per second with an average response time of 0.5 seconds.
This would lead to a multiprogramming level of 50 transactions; that
is, on the average at any instant there would be 50 transactions at
some point in their execution in the system. The lock contention seen
by any given transaction is a function of the current locks held by
other transactions. Thus, the multiprogramming level has a major impact
on the lock contention. Now consider the transition from a database
running on one processor to one being accessed across multiple
processors. Obviously, the response time to obtain a lock will
increase. Using the above discussion, this effect would increase the
multiprogramming level, which could impact the lock contention. This
highlights the key performance objective, which is to provide a
high-speed locking facility that has a minimal impact in transaction
processing performance in a parallel system and has the additional
property that more processors can be added without further affecting
the lock response time (i.e., the algorithms are not a function of the
number of processors in the parallel server).
For historical perspective we also highlight the evolution of the
multisystem locking capabilities of clustered S/390
processors. Shared disks, initially introduced in 1969, provided
multisystem serialization controls by allowing one system to reserve
the device (that is, lock out other systems from accessing the disk)
for a period of time. The global resource serialization
(GRS) component provided shared or exclusive
named locks across multiple systems.
[15] This was used by
operating system components to serialize file access across multiple
systems. This gave the capability to concurrently share files on a
single disk drive. The resource lock manager was introduced to provide
record-level locking and buffer management for IMS*
(Information Management System) in a two-system
configuration. [16]
Finally, the coupling facility introduced
the locking model that is the basis of
this paper. [6] Table
1 summarizes this discussion.
Table 1 History of serialization techniques for the S/390, where
M = number of systems in the cluster
| Year | Type | Relative Performance | Granularity |
| 1969 | Device reserve | I/O operation | Disk drive |
| 1980 | GRS | M - I/O operations | Files, general purpose |
| 1981 | IRLM | 2 - I/O operations | Database records |
| 1992 | Coupling facility | Context switch | General purpose |
S/390 Parallel Sysplex and the locking model
The system consists of multiple operating systems with a
shared-disk architecture. This is referred to as the
S/390 Parallel Sysplex. The base operating system
contains a rich set of services to support the clustering of these
systems. These include membership services, high-speed signaling, and
shared-disk support.
[6,17]
The system, as shown in
Figure
1, consists of up to 32 processing nodes (each of which
can be up to a 10-way SMP) connected to shared
disks. Each SMP runs a single copy of the
OS/390* operating system. For the remainder of this
paper, we use the term system to describe one of the nodes.
The sysplex timer serves as a synchronizing time reference source for
systems in the sysplex, so that local processor time stamps can be
relied upon for consistency with respect to time stamps obtained on
other systems. The coupling facility (CF) is a key
Parallel Sysplex technology component providing multisystem
data-sharing functions. An overview of the various functionalities is
described in Reference 6. It is
important to point out that the
CF is implemented using an S/390
processor with special links to the other systems.
Figure 1
This paper focuses on the locking model. The complete details of this
model are beyond the scope of a single paper; our objective is to
provide an overview of the architecture. In order to simplify the
presentation of the model, the architecture is described from the
perspective of three layers. We first describe the basic underlying
architecture and the functions within the coupling facility. Next we
describe the additional services and functions that are added by the
operating system. Finally we describe the view from the perspective of
a user of the operating system services. This is a unique aspect of the
scheme--that various database lock managers can tailor their use to
create a multisystem lock manager with unique locking semantics.
Level 1: The locking architecture. This section describes the
underlying locking architecture. This primarily describes functions
within the CF. Nevertheless, the functions described
are actually part of both the CF and the operating
system.
Conceptual model. Viewed in isolation, this aspect can be
described as a centralized lock manager that provides simple shared
(SHR) and exclusive (EXC)
semantics. It does not provide queuing. A user can operate against a
named lock table. Multiple tables are supported where each table
contains N lockable entries (0 ... N - 1).
Table 2 shows the semantics of this model.
Table 2 Locking semantics from the Level 1 perspective
| Current State | Request | Outcome |
| Free | SHR | Granted in share mode |
| Free | EXC | Granted in exclusive mode |
| SHR | SHR | Granted |
| SHR | EXC | Granted with warning about share holders |
| EXC | SHR | Rejected but told who is the owner |
| EXC | EXC | Rejected but told who is the owner |
A key aspect of this system that sets it apart from conventional
approaches is that the requests to the locking facility are done
synchronously with respect to the processors that execute the request.
That is, when a lock request is made, the requesting processor
logically stalls until the request is completed. This approach is taken
because the performance characteristics of the locking service make it
technically feasible. The issue of the performance impact on the design
is discussed in more detail in a later section of this paper. This
means that a context switch is not needed, thus avoiding issues such as
the overhead of suspending the requestor and the complexity of
asynchronous locking protocols.
Physical model. Figure 2 shows the key
structures within a locking table. It contains N entries and
each entry has information to track the exclusive or shared
state. For each entry, the first byte is used to contain the system
identifier for exclusive or globally managed locks (these terms are
explained in detail shortly). The second field in the entry is a bit
vector with one bit for each possible system that may have interest for
this entry. These bits represent the interest of a particular lock
manager on a particular system. The concept of individual lock managers
is clarified in the next two sections.
Figure 2
Figure 3 shows the overall structure of the multisystem
locking model. There are up to 32 nodes connected to one or more
coupling facilities. Each system may have multiple links to the
coupling facility for both availability and performance reasons. The
performance of the link is critical to achievement of the synchronous
behavior and is explored in further detail in a later section. There
can be multiple coupling facilities, again for both performance and
availability reasons. And finally, within each coupling facility there
can be one or more named lock tables.
Figure 3
Level 2: The operating system. This section describes the
operating system support for the model. The focus is a simplified
description of the locking capabilities. The actual capabilities go far
beyond the simple shared and exclusive model described here. The
complete set of services in the product is defined in
Reference 10 and
an overview is provided in the next section on system-level components.
Conceptual model. The model presented in Level 1 would have
limited use in any real system. This section discusses the enhancements
in the operating system to provide a richer set of locking semantics
while exploiting the basic services provided in the Level 1 model.
The operating system provides the capability to name a lock request.
The name consists of a character string, to be referred to as the lock
name, and an integer, to be referred to as the hash class value. In
addition, queuing of conflicts and additional semantics to support
multiple lock states (this aspect will be discussed in the Level 3
description) are also provided. The conceptual model has aspects of
both a distributed lock manager and a centralized lock manager. The set
of operating system services that support this model is referred to as
the system lock manager (SLM). The operating system
views the locking architecture as a high-speed lock-contention
detector. By exploiting the locking hardware, the operating system is
actually able to operate in two distinct modes. When there is no
contention, the Level 2 model is that of a high-performing centralized
lock manager. When contention exists, the Level 2 model is a
distributed lock manager that uses signaling protocols to resolve
contention and perform notifications.
Physical model. The implementation of these services is very
complex and a complete description is beyond the scope of this paper.
Instead, several sample lock requests are used to illustrate the key
functions provided. The actual programming interfaces contain much more
than described here. Figure 4 shows the overall model
with respect to this level. Lock queues have been added at each system
and there is a signaling protocol used to resolve lock conflicts. First
we want to highlight where lock state information is stored. The
coupling facility contains the state of each lock table entry, while
each processor contains additional state information.
Figure 4
The notation used in these scenarios is first described.
Figure
5 shows three systems connected to a coupling facility.
The examples used all refer to lock table entries i and k. The
lock table entries are used through a concept called hash
classes, which are described in subsequent sections. The relative
point is that lock names are mapped to hash classes and all lock
managers that are using a common lock table must also use a common
hashing algorithm.
Figure 5
Within the lock table entry the exclusive field (the column
"EXC") will refer to the system that currently
has exclusive control of that lock entry. The share string is a bit
mask that positionally refers to a system that has shared interest in
the lock entry (if set to "1"). For example, a share string of
"011" means that systems 2 and 3 have shared interest. Within each
system four distinct areas of state and queue information are shown.
These are used to clarify the examples and do not necessarily indicate
how it was implemented. The first area, local lock state, is
where each system "remembers" its own view of the state of each
lock entry. This is either no knowledge (0), shared knowledge (S),
exclusive knowledge (E), or that some other system is an exclusive
holder (Gx).
The second area, local queues, shows the software queues
that are maintained when a lock is granted without the involvement of
other systems. This shows the lock name and the hash class value (e.g.,
"A,i"), and the current requestor, state, and ownership (e.g.,
"P
(Own, EXC)"
means that Process
P
owns the lock in exclusive
mode). The third and fourth areas show the same type of information as
in the local queues but with a different perspective. Requests are
moved into the third area when another system becomes the global
manager (i.e., a single system provides overall management of the hash
class). That is, this area shows the view of this system of
a global queue. Requests are moved to the fourth area when this system
becomes the global manager of the queues. Therefore, the locks in this
area represent the complete global state.
The first example (see Figure 6) shows a simple request
from Process
P
for a lock named A in
hash class i. Upon receiving this request the local lock state is
checked for hash class i, and since it has no knowledge of the state it
makes a lock request for hash class i in exclusive state (shown with
action 1 in Figure 6). The coupling facility
receives this request and
since there are no prior requests for this hash class, it sets the
exclusive entry with that of System 1 and returns a positive response.
System 1 then grants the request and sets the appropriate
information in the local queues. It also sets an indicator in the local
lock state and records the full lock name in the local queue area.
Process P
is able to obtain a global lock
without interacting with any other systems.
Figure 6
Figure 7 shows the resulting state after Process
P
on System 2 obtains a lock named C in hash
class k. This also results in a single trip to the coupling facility
where the share entry is now positionally set for System 2. The request
is then recorded in the local queues and granted.
Figure 7
The next example (Figure 8) shows the case where a
process makes a request that is compatible at the hash class level. The
key point in this example is that the global lock is obtained without
even having to visit the coupling facility.
Figure 8 shows an example
of Process P
on System 1 making a request for a
lock named B in hash class i. This time when the local lock state is
checked the system determines that it already has exclusive interest in
the hash class so it does not need to make a coupling facility access.
It can simply check the local queues and determine that the request is
for a different lock than what is currently held. The request can then
be granted.
Figure 8
Figure 9 illustrates how requests "bump into" one
another at the hash class level from multiple systems but are still
compatible. The key point in this example is that the global lock is
obtained without having to exchange lock names even when there is some
collision at the coupling facility. Figure 9
shows an example of
Process P
on System 3 making a request for a
lock named D in hash class k. The local lock state indicates that the
system has no interest in the hash class so a request is made to the
coupling facility. This time the coupling facility sees that System 1
also has a shared interest in the request but since the current request
is compatible (shared) it sets the share entry for System 3 and returns
a positive response. System 3 is never made aware of the other
"sharers" and the request can then be granted.
Figure 9
The previous example (Figure 9)
illustrated the case of a hashing
scheme mapping multiple lock names to the same hash class. A variation
on that example is now shown--when two different lock names map
to the same hash class but they are incompatible at the hash class
level. Figure 10 shows an example of Process
P
on System 1 making a request for a lock named
E in hash class k. The local lock state indicates that the system has
no interest in the hash class so a request is made to the coupling
facility. This time the coupling facility realizes that several systems
have a shared interest in the hash class. The coupling facility sets
the exclusive entry and returns the shared string to System 1. System 1
has now accepted responsibility to sort out the global state of this
hash class. System 1 then begins a process called escalation
in which a global queue for the hash class must be built. It first
parses the shared string to determine that Systems 2 and 3 have
interest in the hash class and signals those systems to return their
lock information. There are two important points here. First, these
signals are done in parallel over a high-speed signaling facility.
Second, only the systems that have a current interest in the hash class
are signaled. This is an important aspect of the scalability of the
design--if there were 32 systems in this example configuration only two
systems would be interrupted for lock information. Once System 1
receives the local information from the other systems, it builds a
global picture and realizes that there is no contention for any lock
name, just for a hash class. This situation is called false
contention and the process
P can be
granted the lock. The example also shows a movement of the queue
information on Systems 2 and 3 to the local portion of global queues.
This is to illustrate that the process now has a responsibility to
communicate with the global manager on future state transitions (e.g.,
unlocks). Also note that the local lock state was changed from S to
G
indicating System 1 is the global manager for
this hash class.
Figure 10
Our final example (Figure 11)
looks at real contention. Figure
11 shows an example of System 2 making a request for
Process P
for a lock named A in hash class i.
The local lock state indicates that the system has no interest in the
hash class so a request is made to the coupling facility. This time the
coupling facility realizes that System 1 has exclusive interest in the
hash class. The coupling facility returns an indicator that System 1 is
the exclusive owner of the hash class. When System 1 receives the
message from System 2 it then builds a set of global queues. Since it
was the exclusive owner it does not have to signal other systems. Once
the queues are built it determines there is real contention. There are
elaborate facilities for handling contention that are described in the
next section.
Figure 11
Level 3: System level components. This section completes the
description of the locking services by highlighting how a system level
component (e.g., a database manager) could use these services. The goal
of this section is to illustrate the flexibility of the model and not
provide a comprehensive description of all capabilities. The model has
been used by several database managers (e.g., IMS,
DB2*, and Virtual Storage Access Method, or
VSAM) and details of their experiences can be found
in related papers.
Although the Level 2 model is a general-purpose lock manager, many
database lock managers have much richer locking semantics such as
multistate support (e.g., more than just the two basic lock states;
shared and exclusive). Other functions such as lock promotion or
demotion (i.e., changing the state of a currently held lock) are
critical for the overall performance in a clustered environment. Using
the semantics of the Level 2 model, the Level 3 model can be tailored
to support virtually any lock model. For example, the lock states
supported by many database systems are far more complex than the simple
shared or exclusive state. Using the following features this can be
accomplished.
This section describes some of the key capabilities that can be
constructed using the Level 1 and Level 2 models previously described.
Since the paper describes a general set of functions that could
be constructed by any lock manager, we use the term tailored lock
manager (TLM) to represent its name. A
TLM can be specific to any environment such as a
database manager or a shared file system. The key point is that the
TLM must be able to optimize its performance based
on its unique environment. The functions in the lower levels provide
these building blocks. These points are illustrated by presenting a
partial list that highlights some of the unique capabilities of this
model.
Lock names and hash classes. In general, database managers
(DBMs) lock on names that are meaningful in their
particular data structure. For example, the IMS
system uses a lock with 19 bytes (or 152 bits) that is representative
of the data in the IMS hierarchical data structure.
The DBM can thus optimize the locking structure
around the data structure. Obviously the number of locks held at any
given point in time is an extremely small fraction of the
2 
locks that are possible. In order to provide for
efficient contention detection between systems, a hashing algorithm is
employed to map the lock names into a hash class table. As long as the
tables are of a size that is many times larger than the number of locks
held, any false contention on a hash class is kept small. An
initialization process is used by the DBM and the
TLM to allocate resources in the system lock manager
(SLM) and the coupling facility to provide for the
appropriate contention detection. At initialization time, the first
TLM that is started calculates an appropriate size
for the coupling facility hash tables (out of the total storage made
available for lock tables and record tables by the customer policy) and
requests the SLM to allocate structures for the hash
tables using a predefined name of a sharing group. Subsequent
TLMs that are started share the same data join in
the use of the facilities allocated by the first TLM
by using the same sharing group name.
The TLM can also specify a 64-byte "user-data"
parameter with the lock requests. One use of this information is
to contain the lock states when the TLM lock
protocol supports states other than just share and exclusive.
Contention detection. A lock can be held with one of several
ownership privileges. Ownership can be granted when the privileges of
the holders (if any) are compatible with the privileges needed by a new
requestor. For example, a lock may be requested with either share or
exclusive privilege. Contention is detected when a share privilege is
requested and a lock holder has exclusive privilege, or when an
exclusive privilege is requested and there are existing holders.
Contention and notify exits. The SLM
communicates to the TLMs via a mechanism called
exits in OS/390. [18] Two of
these exits, the contention and notify, are the means by which the
TLMs resolve contention for shared resources. In
these exits, the 64-bytes of user data can be inspected or modified.
One key value is that complex lock protocols can be implemented using
this structure.
Waiter queuing. When the lock manager cannot grant a lock
because of contention, the SLM preserves a record of
the request on a list of waiters. The rules for processing
waiter queues vary among each pair of DBMs and
TLMs. The queuing is done within the
SLM; however, the rules for lock compatibility are
done by the TLM using the exits mentioned
previously.
Availability and recovery recording. In order to meet the
demanding continuous availability requirements of many of today's
large commercial transaction systems, it is important to allow
processing to continue with full integrity of the database while
handling recovery from a hardware or software failure. Since the
coupling facility is electronically and logically isolated from the
systems that are running the
DBM/TLM/SLM
software, it provides the necessary availability for recovery from a
system or software failure. The coupling facility structure provides
both a locking function and a recovery recording function. Modify
lock names (exclusive locks that are used to update database
records) are recorded in the list elements in the coupling facility
recovery tables. A list is assigned to each instance of the system lock
manager participating in the global managed locking protocols. The
coupling facility uses the user identification (UID)
that specifies the particular system lock manager to access the
appropriate list. Atomic operations that manipulate lock table entries
and record data elements are provided in the coupling facility
architecture. These operations support commands for creating, reading,
replacing, and deleting elements. By providing atomic operations, the
architecture ensures that the lock structure and the recovery structure
are always consistent. Note that in the event of a failure of the
coupling facility, no data are lost since all the information in the
coupling facility is replicated in the set of
TLM/SLMs. Since the architecture
supports multiple coupling facilities, a new structure can be allocated
in another facility and the
TLM/SLMs recreate the coupling
facility contents.
Contention resolution. Let us now examine the process used by
the TLM/SLM for the resolution of
contention. Although the key points are highlighted here, the reader is
referred to Reference 9 for a
complete description. In order to
determine the existence of contention, the coupling facility and
the SLM use the lock compatibility matrix shown in
Table 2.
Any time the requested state is compatible with the existing
state, the request is granted and the lock is locally managed by the
holder(s). When the requested state is incompatible with the existing
state, then the lock becomes globally managed by a chosen
TLM/SLM combination, and
this request along with future requests are processed by the global
manager (TLM/SLM). The chosen
SLM does not manage the contention, but rather
maintains a queue of holders and requestors for the
TLM to use to manage the contention. When contention
is detected, the chosen SLM passes the request queue
to the TLM by using the contention exit mentioned
earlier. The user data information plays an integral role in enabling
user-defined lock protocols. At this point the TLM
must manage the contention by the use of one of the following actions:
- Grant a pending request, possibly with a different state
than that requested. This will cause the requestor to resume, or will
control its completion exit if appropriate.
- Deny a pending request. This will also cause the requestor to
resume, or its completion exit to be taken. However, in this case the
requestor is also told of the rejection and given any data provided by
the denying TLM. The user data can be modified when
the request is denied. Since this is presented to the requestor of the
denied request, the user data can be used to communicate the reason for
denial.
- Regrant a held request with a different state than it was
originally granted (e.g., demotion of a lock that is held exclusive to
shared). In this case the holder's completion exit is initiated,
informing the holder of the change in state. In addition, the user data
can be modified on a regrant.
- Inform a current resource owner that contention exists for a
resource it owns. This is done by passing the notice (through the
SLM) to the holding TLM by use of
its notify (notification) exit.
- Leave a request pending. The request will be left in the
request queue maintained by the SLM. The
TLM may grant the request on a subsequent invocation
of the contention exit (e.g., when the unlock of the current owner is
presented to the contention exit).
This implementation of
TLM/SLM exits and commands
provides a powerful method for implementing lock negotiation protocols
in the TLM while allowing the system
(SLM) to maintain an awareness of the
contention, and the state of the holder/requestor queue. In turn,
this allows the TLM to implement a process whereby
users (DBMs) initially obtain high-level locks, and
then negotiate them downward to finer levels of granularity as
contention occurs. This process of negotiating and notifying continues
iteratively until the TLM determines that no more
notifies are required and the state of the holder/requestor queue is
correct. The global management of the lock continues with the selected
TLM/SLM until contention no
longer exists. Note that due to the flow of requests, including unlock
requests, it is possible that the global manager TLM
does not have any interest in the lock it is managing.
Locking performance
The performance of the locking function is a critical component of
a parallel transaction processing system with shared data. Much effort
went into ensuring that the overhead of locking was as small as
possible. There were many components of this effort. Several of the
most critical are described in this section. It should be noted that
this is not a reflection of a completed body of work, but represents
work in progress. As most complex designs evolve, the performance
bottlenecks become better known and their solutions often affect the
original design accordingly. Locking in a Parallel Sysplex is no
exception.
Synchronous versus asynchronous locking. Accesses to the
coupling facility for obtaining locks require a communication outside
of the processor. In that way it is similar to an
I/O operation. However, the overhead of an
I/O operation is well known to be detrimental to
both system throughput and transaction response time. Much effort has
been expended on nonparallel transaction systems to reduce the number
of I/Os by extensive use of main storage and
expanded storage to hold I/O buffers. Such data
buffering allows records to be accessed without the need to drive an
I/O operation and suspend the transaction. This
reduces the path length of the transaction and significantly improves
the transaction response time.
It is a design goal in the Parallel Sysplex to minimize the additional
unavoidable system overhead introduced by locking without elongating
the transaction response time. This is accomplished by (1) building
protocols on the coupling facility links that optimize the
communications for low latency rather than high bandwidth, and (2)
defining the architectural interface to allow for locking commands to
be delivered to the coupling facility and responses received within a
single CPU instruction. The processor spins in an
idle loop in the microcode from the time the command is sent on the
coupling facility link to the time when the response is returned and
stored in main storage. This is referred to as synchronous
command execution. The alternative design is to perform a context
switch, i.e., the process thread is suspended from execution and a new
process thread is dispatched. This is referred to as
asynchronous command execution.
It is important to validate that this choice is optimal for both design
goals: minimizing system overhead and minimizing transaction response
time. Access times to the coupling facility are measured in the range
of 50 to 500 microseconds, depending on the operation and the speed of
the processor, which is significantly better than the several
milliseconds that an I/O operation requires. So from
the point of view of transaction response time, the synchronous design
is clearly superior.
However, what is less clear is whether the design maximizes system
throughput. Since the alternative asynchronous design is to suspend the
thread and perform a task switch, success of the synchronous design is
determined by comparisons against the overhead of this approach. If the
round-trip access time to the coupling facility takes less
CPU time than the cost of performing a task switch,
then it is deemed to be the correct choice.
The cost of a task switch has two components. The first is the path
length that is required to suspend the work unit thread, to dispatch a
new thread, to field and process the resulting interrupt on the back
end of the operation, and finally, to redispatch the original work unit
thread. A value of 2000 instructions is used for this path length. The
second component is the reduced performance of the processor caches
that results from the context switch and the subsequent purging of the
working set for the thread from the hardware caches. Much work
has been done to estimate this overhead for various processors, and a
value of 4000-5000 instructions is used for this effect. So the total
overhead of a task switch is estimated at roughly 6000-7000
instructions.
Since the overhead of the synchronous access to the coupling facility
is easiest to measure in units of time, it is useful to convert the
asynchronous penalty accordingly. The processor speed is clearly a
relevant factor since it is more costly to allow a faster processor to
wait than a slower processor. The processors under consideration have a
MIPS (million instructions per second) rate that
ranges from 20 to 50 MIPS. Using these numbers, one
obtains a break-even point for the synchronous operation in the range
of 6000/50 = 120 microseconds to 7000/50 = 140 microseconds
for the faster 50 MIPS processor and in the range of
300 to 350 microseconds on the slower 20 MIPS
processor.
The synchronous time for a coupling facility access has several
components that can affect the overall time. These are referred to as
elongators. One critical factor was already described,
namely the link protocol. To minimize this impact, special links were
developed that were point-to-point links and had minimal handshaking.
In fact, the entire link operation consists of a single transfer of the
command block to the coupling facility and a single transfer of the
response block back to the originating processor.
A second key elongator is distance. Most of the negative effects of
distance are eliminated by the link protocol that eliminates end-to-end
handshaking. However, distance still remains a factor and elongates an
operation by about 12 microseconds per kilometer. Since most distances
were originally expected to be within a machine room or, at worst, up
to the limits of fiber-optic channels (about 3 kilometers), it was not
viewed that distance would be a serious concern. However, it is an
intrinsic problem and no amount of additional design will remove it.
A third elongator is the path length needed in the software lock
managers that is required to initiate the operation and process its
completion. This is needed in both the synchronous and asynchronous
designs, but it is still critically important to minimize these path
lengths. Several different efforts have been devoted to minimizing
these paths and that work is ongoing.
A fourth elongator is the complexity of the processing that is
performed by the coupling facility. In this regard, the final design is
somewhat of a trade-off. It had been an original goal to define the
locking architecture so that a hardware implementation would be
possible. But the benefits of the additional functional capability
provided by a microcoded implementation moved us away from that goal.
However, significant effort was spent in making the command definitions
simple and streamlined. One key aspect was the decision to put queuing
and waiter notification in the SLM component and
minimize the function provided by the coupling facility to simple
contention detection and minimal logging for recovery.
The measurements show that a synchronous locking operation on a 50
MIPS processor connected to an equivalent speed
coupling facility completes in about 80 microseconds; this is well
below the 120-140 microsecond break-even point. Similarly, a
synchronous locking operation on a 20 MIPS processor
connected to a 20 MIPS coupling facility completes
in about 150 to 200 microseconds, again below the break-even point.
Where the design is not optimal is when a 50 MIPS
processor is connected to a 20 MIPS coupling
facility. Then the transfer time is still in the 150-200 microsecond
range since the components of time are mostly
determined by delays outside of the processor. However, this is no
longer better than the overhead of a task switch. For this reason and
for similar considerations when data transfers are included, an
asynchronous interrupt has been designed (not in the current product).
Its use would be managed by a heuristic routine that monitors the
synchronous delay and switches modes accordingly. This will be
especially useful in addressing the concerns of distance in a sysplex
where several factors, such as increased fiber-optic capabilities and
the move toward remote site recovery, are changing some of our original
views on distance limitations.
Lock contention. Contention occurs when a lock request for a
resource appears to be incompatible with its current lock state. This
is a statement that is relative to the given level of locking (as
defined in the section on locking models), where each level may have a
different view of contention and may act on it in a different manner.
At the architectural level, contention is detected on hash classes of
locks. Contention, when detected, is reported to the
SLM via information stored in the response block.
The SLM attempts to resolve the contention by
communicating with its other instances that share information on the
hash class and comparing the actual lock name with the set of lock
names for the current owners and waiters. It may turn out
that the lock name that is requested does not match any of these other
names. In this case, contention does not exist at this level and the
lock can be safely granted. This situation is referred to as
false contention, i.e., contention detected at the
architectural level that turns out not to exist at the
SLM level, which resolves the contention.
It may turn out that the SLM finds one or more
matching names in the list of names registered for the hash class. This
is referred to as real contention, although that is actually
a misnomer. The SLM reports the real contention to
the contention exit of the TLM. The
TLM makes the final determination of the contention.
If the TLM can decide the contention is real, then
the requestor is queued. Alternatively, the TLM can
grant some requests that appear (such as from a share/exclusive model)
in contention. It may turn out that the lock name is covering too large
an area of granularity in the database, and finer granularity locks are
required. This results in renegotiating all the relevant locks to a
lower granularity, where actual contention may not exist. This
hierarchical approach to locking is used to avoid getting locks on
areas of the database that are relatively inactive, or tend to be
accessed mostly by a single system. This design can significantly
reduce the overall locking rate and is a key method for gaining
performance by adjusting the locking rate to match the level of
contention in the system.
When the granularity of locking is at its finest level, real and actual
contention coincide. Workload traces indicate that this level of
contention is extremely small in most transaction systems, and it is
generally believed to be the case that significantly less than 0.5
percent of all lock requests experience actual contention.
False contention is another matter. False contention is a function of
the size of the lock table (as a proportion of the number of active
locks in the system), and the hashing algorithm. The size of the lock
table can be reconfigured dynamically to make it larger, so there is
some amount of tuning that can be done by the system programmer to
minimize the occurrence of false contention. The hashing algorithm is
more difficult to manipulate and much work has gone into developing
uniform hashing functions. The general rule of thumb is that the total
contention in the system should be no more than 1.0 percent. If the
locking design is good, real contention can be managed to extremely
small levels. And if the hashing function is relatively uniform, then
false contention can be managed by controlling the size of the lock
table. This is the strategy that is employed.
Consider the example cited in the section on system model and
objectives, where a transaction rate of 100 transactions per second and
response time of .5 second yields a multiprogramming level of 50.
Assuming 20 locks held per active transaction, one can predict 1000
current locks held at any time. If the false contention is to be
maintained at under 0.5 percent, then the lock table should have this
percentage of nonzero entries at any given time. In other words, the
table should be 99.5 percent empty on the average. Thus, the number of
hash classes defined should be 200 times the number of locks held. So
the lock table should be defined with 200,000 entries.
Unlock operations. Special considerations are given to the
handling of unlock operations. In particular, two architecture
extensions were developed to improve on unlock performance. One is a
special interface that allows the operation to proceed asynchronously
without the need for an interrupt, and the second is a command that
allows a list of lock table entries to be sent to the coupling facility
in a single operation.
Asynchronous unlocks. This section describes a capability that
is not in the current product but is relevant to an overall
understanding of the system design. An unlock operation to the coupling
facility is a command that resets an exclusive field or a share bit in
a lock table entry to zero. This creates a window in the state of the
lock table entry where the system performing the reset views the lock
as free, but another system may be viewing the lock table entry just
prior to its being reset. Thus an escalation signal may be received by
the original owning system after the lock is released. This window is
unavoidable and is handled by the SLM chase
protocol. [19]
A consequence of the chase protocol is that it is no longer a
requirement to perform the unlock operation synchronously. The window
exists whether the operation is performed synchronously or
asynchronously to the CPU. If the unlock operation
should fail without updating the lock table entry, then the mismatch in
state is detected by the next requesting system and recovered by the
chase protocol.
The locking architecture exploits this by providing for the
specification of an asynchronous option that causes the
CPU to release control and return to the program as
soon as the command has been transmitted on the link and before the
response is returned. At the conclusion of the command, the response
block is stored in the main storage by the channel, but no interrupt is
generated. At this point, the architectural interface is returned to
the idle state and new operations can be initiated. So, even though the
response is stored, it is not acted upon by the operating system. This
is referred to as the no-response protocol. This allows the overhead of
the operation to be reduced to the minimum start-up penalty
on the issuing CPU and is on the order of 200-300
processor cycles (or about 20-30 microseconds).
This protocol is used selectively by the operating system and is
generally limited to isolated unlock operations (nonlist form) that do
not have any associated record table updates.
Batched unlocks. Most transaction systems obtain locks during
the execution of a transaction as they are needed and released
collectively when the transaction is committed.
[20] So while
locking operations occur as individual events, unlocks tend to occur in
a batch. It is thus reasonable to consider allowing the unlocks to be
batched in a single operation to the coupling facility, and this
function is provided in the architecture. This allows the overhead
required to initiate the operation and the transmission time on the
link to be apportioned across the set of lock names in the list of
unlock operations.
It turns out, however, that the multilevel design approach to locking
makes this a very complex function, and it was added very late in the
development cycle. The problem is that the list must be handled at each
level, and the state of the locks, as viewed by each lock-manager
level, is different. The simplest level is the TLM
level, which simply builds a list of lock names with the associated
hash-class values that are held in either the shared or exclusive
state. The SLM, however, must parse this into two
lists: one list of those hash classes that have undergone escalation
and are managed, not in the coupling facility, but by some instance of
an SLM, and a second list that this
SLM views as managed by the coupling facility.
The first list is actually processed individually by the
SLM, and no performance gain is realized. The
assumption of low contention generally makes this a short list, but the
complexity of handling it must, nevertheless, be incorporated in the
design.
The second list is bundled in a command and sent to the coupling
facility. This is where the performance benefit is realized. Batch
sizes vary, but it is not unusual for a transaction to obtain 10-20
locks and release these all at commit time. So, significant savings can
be achieved with this scheme.
Considerations for multiple lock managers. A key attribute of
transaction processing systems that has already been articulated is
that most locks are generally obtained and released within a
transaction. So, the lock hold times are reasonably short (less than a
second or so). As the example above shows, the number of concurrent
locks held in the system can be estimated as the product of the average
number of locks per transaction and the multiprogramming level. Both of
these can be measured by standard system performance monitors and, from
these, the lock table size can be effectively calculated in the manner
shown above. Subsequent monitoring of the false contention rate allows
the size to be adjusted and tuned to fit the workload.
In order for this process to be successful, the class of locks mapped
to the lock table must be limited to the set of transaction locks
associated with a particular TLM. Otherwise there is
too much unpredictability in the mix of locks. Unfortunately, there are
several multisystem locking components in S/390
systems. Aside from the various transaction managers,
IMS, DB2, and
CICS* (Customer Information Control System), there
are system-locking components such as global resource serialization
(GRS). In the case of GRS, the
locks are often obtained and held for very long
durations. [21] Since the TLM components
have control of the hashing algorithms, special locks, such as the
allocation locks with very long hold times, can be separated into
unique hash classes. These hash classes would be, essentially,
permanently managed by the SLM.
However, in order to remain effective, this separation must be
maintained all the way to the coupling facility. This is accomplished
in the architecture by providing named lock tables and allocation
functions that allow for the creation of multiple lock tables on a
single coupling facility with unique attributes. These unique
attributes include: the presence or absence of share bits, the number
of share bits, the existence of a record table, and of course, the size
of the lock table and record table. But most importantly, the existence
of multiple named tables allows for the separation of locks into
distinct name spaces so that unique management, such as that described
here for minimizing false contention, can be provided by the
TLMs.
Conditional lock requests. Low contention rates are a basic
design premise. Real contention in transaction processing systems is
known to be quite small (under 0.5 percent of lock requests), and false
contention can be managed to similar levels by increasing the size of
the lock table. However, contention at some measurable level does exist
in the system. When it occurs, the overhead can be quite high;
signaling is required from the requesting system to the system managing
the entry. Also, if this is the first indication of contention, the
managing system changes the state of the entry to global management.
This escalation process can be quite lengthy and requires considerable
path length and signaling between systems. There is a corresponding
penalty when contention is removed and the entry is deescalated
from global management.
Conditional lock requests were included in the
TLM/SLM interface to address this
problem. If the lock is requested conditionally and the request to the
coupling facility indicates that the lock entry is not available, the
SLM returns control to the TLM
and takes no further action on the request. In particular, no attempt
is made to signal the managing system and the possible global
escalation is avoided. Subsequently, following a short delay, the
TLM can resubmit the lock request. Assuming the lock
hold time is short and assuming the probability of new contention
occurring is no greater than normal contention, the second request may
succeed at the coupling facility. If this is the case, then the global
management overhead in the SLM is avoided. The
TLM can always abandon this optimistic protocol
after several retries and submit the request unconditionally. Use of
the conditional protocol by TLMs has resulted in
significant reductions in system overhead.
Conclusion
This paper begins with the premise that shared-disk architectures
are better than shared-nothing architectures for clustered systems.
Specific benefits include workload balancing and effective utilization
of the processors, availability, and scalability. It is also argued
that in order to support update-intensive workloads that are often
found in database environments, the shared-disk architecture must be
augmented with special functions to improve the effectiveness of the
sharing of information across the cluster. One such function, locking,
was the focus of this paper. This paper described a high-speed locking
function for use in a parallel operating system environment and
provided a detailed description of the architecture and many critical
design trade-offs. These facilities are embodied in the
S/390 Parallel Sysplex. The facilities are oriented
toward transaction processing systems that are update-intensive and
require low response times. The system has many unique features, such
as the ability to obtain cluster-wide locks using synchronous
protocols, the ability to construct a lock manager that supports a
user-defined lock protocol (i.e., richer than a traditional
share/exclusive protocol), special features for availability, and rigid
performance objectives.
Acknowledgments
The authors would like to acknowledge the large number of people,
across many IBM divisions, who worked on the
S/390 Parallel Sysplex effort.
*Trademark or registered trademark of International Business
Machines Corporation.
Cited references and notes
Accepted for publication January 8, 1997.
|