|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/sj.472.0207 | Copyright info |  |
 |
 |
Real-time Linux in real time
|  |  |
by D. Hart, J. Stultz, and T. Ts’o
|
|
|  |
 |  |  |
|
| |
|
Many applications require an operating system (OS) that provides guarantees on maximum latency, that is, the delay between an event taking place and the initiation, by the OS, of a high-priority application task which had been waiting for that event. For these applications, missing a latency deadline results in consequences that could range from the violation of a service level agreement to loss of life. To avoid such consequences, real-time applications turn to a real-time operating system (RTOS) to ensure that latency requirements are met.
IBM recently shipped WebSphere* Real Time (WRT) version 1.0, a real-time Java** virtual machine (JVM) utilizing Metronome, a real-time garbage collector created by the IBM Research Division. WRT 1.0 uses a real-time Linux** OS that was prepared by a relatively small development team at the IBM Linux Technology Center (LTC) using open-source software development methodologies and lightweight project management processes. This team was able to prepare a real-time Linux kernel suitable for use in enterprise systems a full year before it could have been made available from the leading Linux distributors.
In this paper, we first describe the characteristics of the real-time system which was required to support WRT and its lead customer, Raytheon Company. Next, we describe the development techniques that were used to take alpha-level software patches from the open source community and create a production-quality offering. Finally, we discuss the technical features that comprise the IBM real-time Linux extensions.
| |
|
In this section, we discuss the definition of enterprise real-time OSs and the environment in which they have emerged.
| |
|
Historically, real-time systems have been categorized as hard or soft. Hard real-time systems are those whose guarantees of low latency (typically in the millisecond to microsecond time range) could be mathematically proved. These are generally small, single-processor embedded systems, since locking and multithreading complexities in multiprocessing OSs are too complex to allow for maximum code path analysis to be performed. Very often, these embedded systems had significantly slower CPU speeds than the desktop or server systems of their time, and the main applications for these systems were data collection or immediate control loops, such as an application to keep an aerodynamically unstable fighter aircraft flying straight and level.
Similarly, the networking code used to implement TCP/IP (Transmission Control Protocol/Internet Protocol) is generally considered too complex to be analyzed using maximum code path techniques. For this reason, traditional RTOSs either lacked TCP/IP support entirely, or relegated TCP networking to a non-real-time portion of the system which did not offer latency guarantees.
In contrast to hard real-time systems, all other systems purporting to provide real-time services are categorized as soft real-time systems. These systems have included traditional legacy UNIX** systems, which implemented priority scheduling interfaces but contained few or no core OS kernel changes to provide latency guarantees. Hence, soft real-time systems have often been considered to be lacking any real-time characteristics.
| |
|
The artificial dichotomy between hard and soft real-time systems has always been problematic. The characterization real-time is in fact a multidimensional quality-of-service continuum covering a wide variety of maximum latency time guarantees, differing levels of assurance that these guarantees are met, and differing levels of functionality under which latency guarantees are provided.
Given modern hardware and software architectures and design patterns, the terms have become even more inadequate of late. One reason for this is the increasing speed of computers. Thirty years ago, a Cray-1 supercomputer could perform 160 million floating-point operations per second and had 8 megabytes of memory. Today, that supercomputer would be considered a modest embedded system, not even capable of running such applications as the Mozilla Firefox** Web browser or the OpenOffice.org Productivity Suite**. In addition, in the last three years, increasing numbers of processor cores have been placed on a single chip.1,2 Thus, not only are clock speeds much faster compared to five or 10 years ago, but symmetric multiprocessors (SMPs) are now often found in even in the smallest systems.
In addition, the ubiquity of networking as an integral part of computers participating in an interconnected world has meant that customers now assume TCP/IP support as a matter of course. The rise of service-oriented architecture (SOA) has led customers to demand real-time middleware services using DDS (Data Distribution Service),3 CORBA**,4 AMQP (Advanced Message Queuing Protocol),5 JMS (Java Message Service), and other messaging frameworks. Even simple HTTP (Hypertext Transfer Protocol) interactions may benefit greatly from low-latency guarantees, especially given the multiple levels of latency introduced by tiered architectures and Web applications that cross Web site boundaries.
Finally, there has been an increasing desire to use high-level programming languages, such as Java, even in embedded environments that had previously been dominated by assembly and C language programming. There are three reasons for this: first, with faster processors, the need to minimize the number of cycles per instruction in order to boost performance is not as critical; second, the need to implement ever-increasingly complex systems has led to the development of higher-level languages; and finally, with hardware costs decreasing, development time and expenses are now the most important contributors to the overall cost of a computing system.
| |
|
In light of these changes, it is useful to define a new term, enterprise real-time OSs, to describe real-time systems that avoid the hard-versus-soft real-time system dichotomy. Enterprise real-time OSs have the following characteristics:
-
Support for SMP systems
-
Inclusion of networking technologies such as TCP/IP or InfiniBand**
-
Support for commercially available middleware, such as databases or Web servers (perhaps running only as background, non-real-time tasks)
-
Support for higher-level programming languages.
In addition, these OSs may or may not include mathematical proofs of maximum latencies. This characteristic may prove troubling to those who need to create life-critical or mission-critical systems. However, the reality is that modern systems are so complex that mathematical proofs of maximum latency are intractable and have been abandoned even in most so-called hard RTOSs. In addition, latency “misses” can also be caused by hardware failures—in the real world, hardware perfection can never be guaranteed. Hence, designers of systems that cannot tolerate a single point of failure are well advised to use high-availability systems that can compensate for latency misses, whether caused by application or OS bugs, hardware failures, or catastrophic attacks on a portion of the data center.
In our real-time enhanced Linux, while running a high-stress background test workload that saturated the CPU with multiple compiling jobs in parallel and disk and file system benchmarks, and while sending and receiving large amounts of network traffic, we were able to measure scheduler latencies of less than ten microseconds, with the maximum outliers at 51 microseconds—a very respectable result compared to many RTOSs on the market. Our system is nevertheless entirely compatible with the Red Hat Enterprise Linux6 user space, with full networking functionality, unlike other RTOSs on the market. (User space, or process address space, refers to the virtual address space of a process. Each process has its own virtual address space that is protected against access by other processes.)
| |
|
Real-time Linux not only brought significant changes for the enterprise, but also had a significant impact on the internal processes and role of the Linux Technology Center (LTC) within IBM. The project had a very rapid timeline and was directly customer-facing. It was not designed to be an actual product, but was based on an experimental branch of the Linux kernel source; both of these were unprecedented for the LTC.
Typically, teams in the LTC work closely with Linux distribution partners. The LTC team concentrates primarily on support and stabilization of the distribution on IBM hardware, and on features required by IBM software offerings. This approach allows IBM to ensure that the OS will meet customers' needs, and allows IBM and the distribution partners to manage customer support issues collaboratively.
While a good deal of effort was spent on stabilization and IBM hardware support for real-time Linux, the primary customer requirements addressed system-deterministic behavior and low latency. The requirements of the real-time Linux project came from an early-adopter customer, Raytheon,7 and the IBM Java Technology Center (JTC), which would be providing the real-time Java implementation that would be used for application development. As no enterprise real-time Linux distributions existed at the time, there was no third-party product to which changes could be made; the LTC real-time team would have to create a product of its own.
From the time the project requirements were known and the team was assembled, there were only nine months before the general availability (GA) date of the product, with several intermediate releases planned (see Figure 1). In addition to the already compressed schedule, the entire real-time stack was being developed in parallel, leading to an ever-changing requirements landscape and interlocking checkpoint deadlines. The customer was actually writing applications while the JTC developed the JVM and the LTC continued development of the underlying OS. With a significant number of feature requirements, stabilization, and release testing to be done in such a dynamic environment, some lightweight tools and processes were clearly needed. To minimize overhead, these tools would need to leverage the existing LTC infrastructure as much as possible.
Figure 1
| |
|
Because the developers would be working closely with the Linux community, care was taken to select individuals who were familiar with the community's distributed development model, often referred to as patch culture. Geographical location was not a major factor, because much of the collaboration would be done by use of public mailing lists. A geographically distributed development team was assembled that comprised enthusiastic open source developers. In many ways, the team was itself a miniature open-source community.
As with any active project, a means was needed for archived broadcast communication. Using the LTC mailing list tools, we created the requisite user, development, and announcement lists typical of open projects. Mirroring the development methodologies of the Linux kernel community, before changes are committed to the source repository, the patch (along with a detailed description, justification, and “signed-off-by” line) must be sent to the development mailing list.8 Before this patch can be committed (i.e., made official), it must be reviewed and acknowledged by at least one member of the team. For example, the following text describes and justifies a patch:
| |
This patch, which avoids deadlocks that were possible with the old periodic hook fix, collects cycles every interrupt into a cycle bucket which is then used when accumulating time. Thus, only the clock read is done in interrupt context and we don't have to worry about system hangs.
thanks
-john
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: Darren Hart <dvhltc@us.ibm.com>
|
During the final phase of a release cycle, critical fixes must be reviewed and acknowledged by at least two individuals, one of whom must be the release owner or the team leader. This process ensures that team members review each other's code, while at the same time recording a history of changes and responsibility and satisfying the submission requirements for Linux patches.
While we were eager to make the real-time technology robust in order to give our customers early access to an enterprise real-time Linux OS, neither we, nor our customers, wanted to carry out-of-tree patches (i.e., those which are maintained independently of the main Linux kernel source tree) for very long. Such patches are difficult to maintain, as the Linux kernel source is rapidly changing.9 Our goal was to incorporate, back to the mainline sources, any changes we made to the kernel or system libraries in support of real-time operations. For this to work well, it was critical to foster a positive relationship with the open source community from the beginning, sharing our tests, results, and fixes as they emerged.
Up to this point, the real-time Linux community comprised a few interested parties communicating primarily on the Linux kernel mailing list.10 In an effort to increase awareness and exposure of real-time Linux, we worked with leaders within the community to build a public real-time Linux wiki11 and create a real-time users mailing list.12 These dedicated resources host tutorials, test cases, and related information as well as provide a forum for users, potential users, and potential developers to ask questions about real-time Linux.
| |
|
Rigid, state-based tools were inappropriate for adaptation to the rapidly changing needs of the real-time project. Tools were required which were flexible and capable of growth in complexity along with the project. Nevertheless, it was necessary to leverage existing infrastructure and tooling in order not to lose valuable development time in setting up and administering custom tools.
The internal LTC wiki, based on the MoinMoin13 wiki engine, was being used for various purposes throughout the LTC, but only a few teams had employed it as their primary repository of information. We embraced the freeform nature of the wiki, and built our own project definition, status reporting, and resource structure on top of it. Because the wiki could be edited by everyone on the team, progress was not delayed while information was distributed among the team leaders. This open development model served as a guide for our other processes and tools; each developer needed to have easy access to project requirements, tasks, and status, and the ability to update them without funneling the information through a single person. The freeform nature of the wiki can lead to stale information and a site that is difficult to navigate if left unchecked. We were therefore careful to review, prune, and reorganize the wiki periodically.
While the wiki proved exceptionally useful for documenting project and individual status, there remained certain tasks that required more structured input and change histories. For bugs and atomic features (i.e., very specific features that are largely self-contained), we took advantage of the existing LTC Bugzilla** bug-tracking system. When the existing Bugzilla fields did not quite meet our needs, we simply added labels to the titles; these labels could be queried just like any other field. While this was merely a convention, and required discipline from the individual team members, it was less costly than trying to modify the LTC-wide Bugzilla to suit our particular project, or trying to administer our own Bugzilla system.
In summary, to meet the ever-changing needs of our fast-paced project, we selected highly flexible tools and defined structure and policy conventions to address our specific needs.
| |
|
After nine months of intense effort, we successfully delivered the GA release of the IBM real-time Linux extensions. Nearly a year after the GA, we have delivered a service release and several intermediate releases to address customers' specific needs. By this time, the market has started to catch up, with both Red Hat and Novell, Inc., working on enterprise real-time Linux products. As real-time Linux becomes more mainstream, our team's working model has started a gradual migration back toward that of a typical LTC team. We have begun the transition from delivering directly to customers, to working with the enterprise distributions. Additional scrutiny from upper management and general interest throughout IBM, as evidenced by increased mailing list subscriptions, forced us to update our tooling to provide more readily accessible status information for various projects. With such a geographically distributed team, some form of asynchronous status reporting was needed, as disparate time zones made conference calls impractical. Fortunately, the existing tools were up to the task. Several custom Bugzilla queries that were accessible through links from the wiki, in addition to a slightly more formal project structure, have provided the needed detail, without the need for excessive formal meetings, conference calls, or more rigid tooling.
| |
|
For the past two years, Ingo Molnar's real-time patch set has served as a staging ground for developers involved in introducing real-time functionality into the Linux kernel. Some of this work has been in the experimental tree—namely, features are tested and may fail. However, the goal of the patch set is not to develop a real-time fork (i.e., the point at which the source code of a project is split and two new projects are formed with differing goals, generally a last resort) or an out-of-tree product, but to provide a test bed for real-time functionality prior to its introduction into the Linux kernel. Some patches rapidly move from the tree to the mainline kernel; others are slowly refined before this move.
A good number of features recently included in the mainline Linux kernel began in the real-time tree. It should be emphasized that Ingo Molnar, Thomas Gleixner, and many other members of the community deserve our thanks and credit for the work outlined in the following subsections.
| |
|
One of the first examples of large segments of the real-time kernel that moved to the mainline kernel was the conversion of semaphores (counters used to limit the number of concurrent accesses to a resource14) to mutexes (a mutual exclusion lock ensuring singular access to a resource.15) This overhaul laid the groundwork for many of the synchronization features in the real-time kernel. The larger Linux community is very protective of its established user base, so any new features are carefully reviewed for any negative impact. Because real-time functionality often implies a trade-off of determinism for performance, the Linux community has been wary of real-time-related changes. Thus, it was necessary to demonstrate that real-time changes also benefit non-real-time users. For example, a motivation for moving from semaphores to mutexes in the mainline kernel was for code optimization and a reduction in data size, resulting in an increase in performance and scalability. Also, the simpler mutex usage model, which limits how mutexes can be used in comparison to semaphores, results in code that is easier to understand and facilitates better debugging tools.
With the mutex infrastructure in place, the next major feature to be moved to the mainline kernel was real-time mutexes. Real-time mutexes introduce the concept of priority inheritance to mutexes; this is important for determinism in code paths that require serialization. Priority inheritance in the kernel has historically been a contentious issue; however, by use of the fast-user mutex (also known as futex) interface, real-time mutexes can be used to provide priority-inheritance-enabled synchronization primitives at the user level.16 This approach of providing desired user-level features with little impact on kernel space proved successful in causing past objections to priority inheritance to be dropped, allowing real-time mutexes to be included.
The real-time mutexes provide even more benefits in the real-time kernel. Latency can be greatly reduced by redefining almost all spinlocks in the kernel (i.e., locking mechanisms used for very short access to resources where the waiting code “spins” in a busy loop until the lock is free1) to instead use real-time mutexes for serialization. This is probably one of the most significant changes in the set of real-time patches, contradicting the assumptions of code written using spinlocks. Spinlocks by definition do not sleep, whereas mutexes do. For example, traditionally, when a process holds a spinlock, preemption is disabled, so it is essential that it does not sleep. If a higher-priority process tries to acquire the lock, the process will spin, attempting to acquire the lock, blocking the lower-priority process from ever waking up. However, mutexes allow processes requesting locks to sleep, so when the higher-priority process tries to acquire the lock, it will sleep and wait for the lower-priority process to wake up and release the lock.
Priority inheritance helps further, by boosting the lock holder's priority to that of the highest-priority waiting process. Thus the benefit of using real-time mutexes in the place of spinlocks is that any low-priority process can be preempted in favor of a high-priority process, even if the low-priority process is in the kernel holding locks. This is safe because should the high-priority process need a lock that a preempted or blocked lower-priority process is holding, the priority inheritance code will boost the lower-priority process, letting it run until it releases the lock, at which point its priority will be restored and the high-priority process will continue.
Because not all code paths are safe for preemption, there are still traditional spinlocks (called raw locks) that can be used when needed. The use of these locks thus reduces the non-preemptable code paths to only those few auditable paths where raw locks are used.
| |
|
To further reduce latency, all interrupt handlers with the exception of the timer interrupt have been converted to kernel threads.17 When an interrupt triggers, instead of calling the handler immediately, the kernel schedules the kernel thread responsible for that interrupt and returns. This greatly reduces the amount of time an interrupt can preempt high-priority processes. As a schedulable entity, the kernel thread competes with other processes and threads for CPU time, and the interrupt thread priority levels and CPU masks (i.e., the list of CPUs on which the thread is permitted to run) can be specified from the user space. This feature, however, imparts real-time processes with more control over the system than most developers are accustomed to. As interrupt threads have a scheduling priority, they can be preempted by higher-priority processes. It is quite simple to write a real-time process that can “starve” interrupt processing, causing critical devices, such as a SCSI disk holding the root file system, to “timeout” (i.e., reach its expiration time) or fail.
| |
|
Another feature included in the mainline kernel is high-resolution timers (HRTs).18 HRTs require more invasive changes than most features. It was necessary to introduce a new timer infrastructure, called hrtimers. These hrtimers are different from the preexisting timers (also called timeouts), as timeouts were optimized, in an apparent paradox, for events that do not normally occur. For instance, consider the select() system call, which waits either for input on a file descriptor or for a specified timeout to occur, and then returns. In the kernel, there are a number of these error condition timeouts, for functions like block I/O, networking, etc. These error paths are rarely time critical, so the timeout values are somewhat approximate and are not affected by small delays. The timeout infrastructure was very good for adding timers and removing non-expired timers quickly, but it was not optimized for actually firing timers.
The hrtimers function focuses on timers that need to fire with a very fine precision. Thus they are set to expire using a new time structure, ktime_t, which efficiently stores time with nanosecond resolution. To actually fire these timers at such a precise resolution would require another logical change, namely to allow for dynamic programming of the interrupt sources, also called clockevents. Prior to the use of clockevents, timer ticks were used, which fired at a fixed frequency of every one to ten milliseconds (depending on the frequency of the scheduler tick configuration). This coarse interrupt frequency limited hrtimers from firing with nanosecond precision, so the clockevents code was created to allow timer interrupts to be dynamically reprogrammed for the next hrtimer. Each time an interrupt fires, the hrtimer list is inspected and the timer is reprogrammed to fire when the next hrtimer is set to expire, which might be ten microseconds away. The precision of hrtimers is limited only by the precision of the available clockevent drivers. Since different machines can have different timer hardware, the clockevent drivers allow for a wide range of timer devices in an architecture-neutral fashion.
| |
|
Building further on hrtimers and clockevents, the architecture-neutral dyntick infrastructure was created to allow the processor to stay for extended periods in a low-power state.19 Prior to the use of clockevents, the timer interrupt was used both for firing timeout timers and as a scheduler tick, allowing the scheduler to check whether the time-slice of the current process had expired.
However, with timer interrupts being dynamically programmed, there is no scheduler tick. The scheduler was reworked to be called from an hrtimer itself. This allowed the scheduler tick to be slowed down as needed simply by altering the frequency of the hrtimer used to trigger it.
When the system is idle, the processor can enter low-power states and save energy. However, without dynticks, the processor has to wake up every one to ten milliseconds for a timer interrupt, thus limiting the amount of power saving possible. With dynticks, if the system is idle and there are no hrtimers about to expire, the processor can stay in the low-power state for as long as possible. Further, if a process is running and there are no hrtimers firing, the dynticks code will not interrupt user-space processes to perform unnecessary scheduling work, further reducing scheduling overhead.
| |
|
Finally, Ingo Molnar's Completely Fair Scheduler (CFS) was tested for a short period of time in the real-time patch set before being moved to the mainline kernel.20 A full discussion of the design is outside the scope of this paper, but the most distinguishing feature is that instead of using run-queues to store pending tasks, the scheduler uses a time-ordered red-black tree (a self-balancing binary search tree).21
This is interesting because it logically follows from the design philosophy of high-resolution timers and dynticks: moving away from periodic ticks as the primary unit of time measure and instead using nanoseconds as the base unit. The scheduler is also somewhat modular. There can be different classes of tasks that can be scheduled, where the classes are prioritized and each may have its own scheduler logic. This allows for a real-time scheduler to be used for real-time processes without complicating the scheduler used for non-real-time tasks.
| |
|
While all of the features discussed here were the result of community collaboration, IBM provided some much-needed foundational work, as well as stress testing, debugging, and hardening to make the features enterprise-ready. Timekeeping
While not as exciting as the many features previously presented, reworking the Linux timekeeping system laid critical underlying infrastructure for functionality such as hrtimers and dynticks. The timekeeping rework began to address problems enterprise customers had encountered with Linux. At times, it was observed that “time” skipped backward or ran too fast. To explain how this could occur, some background on the system is needed.
As mentioned in the subsection “High-resolution timers,” Linux had utilized a periodic tick for scheduling and timers. Another function of this periodic tick was to increment the internal time of the system by one tick length. Every one to ten milliseconds, a timer interrupt would fire, the internal time of the system would be incremented by one tick, some process accounting would go on, timers would be fired, and finally the scheduler would run. This worked fine in the general case; however, using this method alone for timekeeping resulted in very coarse-grained time (one to ten milliseconds).
As a result, the kernel would use a secondary high-resolution counter of some sort to measure fine-grained time between ticks. When the interrupt would fire, the value of the secondary counter would be stored as the timekeeping values were updated. Then, if an application called the method gettimeofday(), the secondary counter would be read again, and the difference calculated from the last tick was then added to the internal time of the system.
The problems with this design are subtle. One of the most common secondary counters used for timekeeping was the time stamp counter (TSC) on i386**-based systems. Most of the issues cropped up when processors started changing their frequencies to save power. This resulted in the TSC frequency changing as well. Since the design of the system required the two clocks, the periodic tick and the secondary counter, to be calibrated against each other, any change in frequency of either clock resulted in incorrect timekeeping. Secondly, adjustments made by the network time protocol (NTP) to keep the system clock in sync with network time were only applied to the tick-based internal time; thus, there could be small inconsistencies, where time might go slightly backward immediately after an interrupt.
These problems in such a critical subsystem could create serious difficulties for real-time applications. If system time might go backward or jump forward unexpectedly, how can one determine the right time to fire a critical timer, or if deadlines are truly being met?
To resolve these issues, the timekeeping subsystem was fully rewritten with assistance of Roman Zippel, Thomas Gleixner, and Ingo Molnar, along with other members of the community. Instead of using a method utilizing a dual clock tick with fine-grained offsets, the new subsystem uses only one trusted counter. This counter is abstracted using the clocksource structure. Clock sources allow only a small amount of architecture-specific code to be used with a generic timekeeping core to provide accurate time. Using clock sources, at every timer interrupt, we read how much time has passed using the counter and add that to the system time. When an application calls gettimeofday(), the interval since the last timer interrupt is simply calculated and added to the internal time of the system. This way, we do not have to deal with inconsistencies between multiple clocks, and adjustments made through NTP can be applied directly to the clock source being used. Another important benefit is that since there are no assumptions about “ticks” or how much time passes between interrupts, there is no need for a periodic timer tick. Whether a timer interrupt fires every 100 microseconds or every 100 milliseconds, the right amount of time is always added. This detail was crucial for the dynticks infrastructure to function properly. Kernel hardening and developer usability
It would be unwise for any company to apply a 1.5 megabyte patch to a kernel and simply ship the results to a customer. Thus, before IBM would let customers use the real-time kernel, quite a bit of testing, debugging, and defect removal to harden the kernel was needed.
Some of the first issues encountered were developer-usability issues. We contributed fixes that re-enabled and improved the robustness of the functions profile and oprofile, to help developers analyze their systems. Further, real-time programming takes careful consideration, and the power given to a programmer with the real-time kernel is increased. In developing stress tests and benchmarks, systems frequently crashed due to unrestrained real-time processes. In these cases, a process was frequently given a very high priority and, sometimes unintentionally, would run in a tight loop, over-utilizing the CPU. Since the real-time kernel allows much of the kernel to be preempted by high-priority processes, the unrestrained task would prevent the kernel from executing critical functionality.
We encountered this in a number of cases. In one case, the timekeeping accumulation code was reworked to execute as a low-priority timer. This was safe on most systems, but some systems have clock sources that overflow after just a few seconds. Thus, if the timekeeping accumulation was prevented from executing for even a few seconds, time would stop increasing and would wrap back. The timekeeping code had to be further refined to protect against clock source overflow while still allowing the majority of the timekeeping code to be deferred so it would not block real-time user processes from running.
Another issue we encountered was with the SMP interrupt affinity code. When an interrupt is given an SMP affinity mask, the interrupt is usually programmed to be routed to just one of possibly many CPUs. Unfortunately, if a runaway task was over-utilizing one CPU, it was possible that important interrupt processing would be deferred, even if other CPUs were idle. This is problematic because if the interrupt is for the SCSI disk where the root file system is mounted, it could paralyze the system. Thus, we provided a fix to allow the interrupt processing to be scheduled on any of the CPUs in the affinity mask.
Another way we improved the developer-usability issues surrounding unrestrained real-time tasks involved re-enabling CPU run-time limit (rlimit) functionality by use of the posix-cpu-timers code (responsible for the accounting of time processes run on the CPUs). One of problems with the posix-cpu-timers code was that it was normally called from the timer interrupt context. Since the timer interrupt is the one interrupt not executed in a kernel thread, it requires that all locks taken be raw locks, as sleeping is not allowed in interrupt context. The posix-cpu-timers code unfortunately needs to acquire a large number of locks. Converting those locks to raw locks would increase latency, so posix-cpu-timers were simply disabled in early versions of the real-time kernel. We re-enabled it by using a deferral method similar to that used by the interrupt threads, and created a posix-cpu-timers kernel thread that is scheduled to run at each timer interrupt. However, since this kernel thread could be deferred by high-priority real-time processes, we forced the kernel thread to run at the highest priority level, to enable the enforcement of CPU runtime limits against all but the highest-priority real-time tasks. This does have the potential to increase latency (and it is likely that the posix-cpu-timers will be reworked in the future), but it allows developers to enforce run-time limits and stop unrestrained real-time tasks without having to reboot the system.
Finally, one last way to aid debugging was to introduce a modified version of the sshd init script that would start an ssh (secure shell) daemon running at the highest-possible real-time priority on a debug port. In many cases, this allowed developers to log in and narrow down which of their real-time process had gone awry.
| |
|
Testing of the real-time kernel is unique in one very important way. Typical performance numbers measure average throughput or completion time. Real-time applications with strict latency requirements are not concerned with how fast they run most of the time, only with how slowly they run in the worst case. To meet the unique needs of real-time systems, we have written several micro-benchmarks that test a specific function or latency path repeatedly and report basic statistical information about those results (see Reference 22). A small sampling of the results from two of these tests is included here to illustrate the improved determinism of the real-time Linux kernel over the mainline kernel. The following results were obtained from tests run on a four-way AMD Opteron** system with 8 gigabytes of memory. The results labeled as “Loaded” were achieved with a background load of a parallel Linux kernel build in a loop.
The time-dependent nature of real-time applications means that they make frequent use of the gettimeofday() system call. They require that it be both accurate and reliably fast, even under heavy system loads. The “gtod” (i.e., get time of day) latency micro-benchmark measures the elapsed time between pairs of gettimeofday() calls. As illustrated in Figure 2(A), the version 2.6.16 mainline kernel performs reasonably well under no load, with a maximum latency of only 33 microseconds and a standard deviation of 0.57 microseconds. Under loads, however (see Figure 2(B)), the maximum latency of the mainline kernel increases to 162 microseconds, with a standard deviation of 0.59 microseconds. The real-time kernel, on the other hand, achieves a 14 microsecond maximum latency with a standard deviation of 0.48 microseconds under no load (see Figure 3(A)). While this is an improvement, the true test occurs while under load, where the real-time kernel maintains a maximum latency of only 19 microseconds, with an impressive standard deviation of only 0.45 microseconds (see Figure 3(B)).
Figure 2
Figure 3
Another common task for real-time applications is the periodic scheduling of tasks. The tasks are expected to be awakened and run as close to their scheduled time as possible. The “sched” latency micro-benchmark measures the scheduling latency of a high-priority 5-millisecond periodic thread. Due to the fact that the version 2.6.16 mainline kernel did not have the new high-resolution timers (see “High-resolution timers”), its timer resolution was only as good as three times the timer tick period, totaling about 3 milliseconds in this case. This makes periodic scheduling impractical, as evidenced by Figure 4. With the high-resolution timer code in place, however, the real-time kernel achieved an unloaded maximum scheduling latency of 14 microseconds, and a loaded maximum of 51 microseconds.
Figure 4
The complete test suite covers various latency, functional, stress, and scalability tests. It has served well for release testing, and for detecting regressions in the primary community-maintained tree.
| |
|
Providing the Linux kernel functionality was only the first step to making enterprise real-time Linux a success. An RTOS relinquishes some amount of control to the user space, providing the application developer with more control over the system as whole. The real-time JVM abstracts much of this control away from the end developer, and certain precautionary default configurations can help reduce catastrophes.
IBM WRT for Java provides an abstraction layer to the POSIX** interfaces for accessing real-time capabilities. In addition, the real-time JVM alleviates the issues of unpredictable latency inherent in traditional JVMs. By way of an innovative incremental garbage collector, the real-time JVM guarantees that the execution of application code is never delayed due to garbage collection for more than a specified amount of time in a given period. The real-time JVM also abstracts system-level priorities to the application developer, leaving a range of priorities free for the JVM, kernel, and system threads to operate without having to worry about a heedless developer causing starvation.
With the additional kernel threads created by the PREEMPT RT patch, system configuration has become much more flexible, and more complex. By providing a mechanism to adjust the priority of these new threads at boot time, the system can be tailored to meet the needs of a wide range of applications. The default settings prioritize most interrupt handlers above the priority range of the JVM, ensuring a stable system. These settings can be fine-tuned over the course of application development to meet specific latency requirements.
In a typical Linux system, access to real-time scheduling capabilities is restricted to the root user. In an enterprise real-time environment, it is not acceptable to have every application with latency requirements running with full root privileges. A slight modification to the pluggable authentication modules (PAM) enables the administrator to grant real-time capabilities to users of a specific group, typically the real-time group. This provides the means to limit the privileges of real-time applications while still providing them with the necessary capabilities to access real-time functionality. Reference 23 provides a more thorough coverage of IBM WRT for Java.
| |
|
Enterprise real-time Linux is a radical development in a few critical ways. First, it is a case study in how open source techniques and leveraging work done by the open source development community can accelerate product development. Second, it demonstrates the merging of real-time technologies into enterprise systems and into a high-level language such as Java.
Bringing real-time Linux to market was a swift and challenging task. The use of open tools and a distributed development model mirroring that of the Linux community contributed to our team's ability to work quickly and effectively, while keeping them in sync with our community colleagues. Ingo Molnar's PREEMPT RT kernel patch, while experimental, provided a high-quality code base on which to base our product development. With a goal of stability and robustness, we were able to prepare the code base for customer deployment while giving back to the community through feature development, defect removal, new benchmarks, and exhaustive testing of key real-time features on a variety of enterprise systems.
Combined with the WRT product released by our colleagues at the JTC, we were able a deliver a real-time Java solution composed of IBM blade servers and IBM storage technologies, real-time Linux, and the real-time JVM. Our lead customer, Raytheon, is using these IBM technologies as part of the DDG 1000 Zumwalt-Class Destroyer program, the next-generation destroyer for the U.S. Navy.7 In addition, there are a number of other federal systems integrators as well as companies in the financial sector that are examining the technology and embarking on pilots and proofs of concept utilizing this real-time technology.
With real-time Linux products becoming available from the enterprise Linux distributions, we are making a transition to a more typical role, working with the distribution partners to incorporate features and fixes that are important to IBM and its customers. We are pleased to relinquish the role of release engineering to them and devote our efforts to improving the real-time kernel. During the coming year, we plan to focus our development efforts on improving scalability and reducing the performance costs associated with low latency and determinism.
*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.
**Trademark, service mark, or registered trademark of Sun Microsystems, Inc., Linus Torvalds, The Open Group, the Mozilla Foundation, OpenOffice.org, Object Management Group, Inc., the InfiniBand Trade Association, Intel Corp., Advanced Micro Devices, Inc., or the Institute of Electrical and Electronics Engineers in the United States, other countries, or both.
| |
|
Accepted for publication November 15, 2007; Published online May 1, 2008.
|
|