|
|
The resiliency challenge presented by soft failure incidents
|
|
|
by J. M. Caffrey
|
|
|
A common problem observed by mainframe operators, and one which presents a significant challenge for resiliency and high availability, involves soft failure incidents. In contrast with catastrophic failures, soft failures involve some degree of system shutdown without a known cause in component or subsystem malfunction. This has been described with the catch phrase “Systems don’t break; they just stop running, and we don’t know why.” Extending a medical paradigm, this paper proposes a new method for solutions deployed on z/OS™ to respond when either the system or the application stops running. The current approach is to treat the “disease,” by determining the cause of the problem and taking action to prevent its reoccurrence. The new approach is to determine whether the system or application is behaving abnormally, identify the cause of this abnormal behavior, and to take action to treat the “symptom”. This new approach uses machine learning and mathematical modeling to identify normal behavior, enabling the detection of abnormal behavior before it impacts the customer. Based on an analysis of critical problems and preliminary modeling work, the types of abnormal behavior identified are assigned to broad categories. In this paper, we describe the progress being made to address the challenge of soft failures by implementing this new paradigm.
|
|
|
|
|
|