IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Systems Journal Papers accepted for publication

Continuously Available Systems   Preliminary abstract

Scoring and thresholding for availability

by S. Heisig
and J. Hosking
As the capacity of hardware systems has grown and workload consolidation has taken place, the volume of performance metrics and diagnostic data streams have outscaled the capability of people to handle these systems using traditional methods. As work of different types (such as database, batch, and Web processing), each in its own monitoring silo, runs concurrently on a single image (operating system instance), both the complexity and the business consequences of a single image failure have increased. This paper presents two techniques for generating actionable information out of the overwhelming amount of performance and diagnostic data available to human analysts. Failure scoring is used to identify high-risk failure events that may be obscured in the myriad of system events. This replaces human expertise in scanning tens of thousands of records per day and results in a short prioritized list for action by systems staff. Adaptive thresholding is used to drive predictive and descriptive machine-learning-based modeling to isolate and identify misbehaving processes and transactions. The attraction of this technique is that it does not require human intervention and can be reapplied continually, resulting in models that are not brittle. Both techniques reduce the quantity and increase the relevance of data available for programmatic and human processes.

    About IBMPrivacyContact