Skip to main content


next previous up

Next 5- Application: Computer Immune System
Previous 4.1- Characterizing the algorithm using random sequences
Up 4- How Good is the Algorithm?

4.2- The false-positive record

The algorithm has been used to extract most of the computer virus signatures used by IBM AntiVirus. Only a small handful of false positives have been reported. In most cases, the offending signatures have been those taken from a virus written in a high-level language such as C or Pascal. Such viruses tend to be even more of a problem for human experts than for the algorithm!

It is often difficult to extract decent signatures for such viruses because compilers tend to introduce a lot of boiler-plate code that gets intermingled with the meat of the virus code, obscuring any idiosyncracies that might be used to identify the virus. In other words, program individuality is largely washed out by compilers, making it intrinsically difficult to find a good signature. Usually, there are just a few pockets of unusual code, which can be very difficult for even the most expert of humans to find. It is hard to imagine that a human would want to be good at doing this, because it takes a lot of very specific knowledge about machine code that is produced by each of the several dozen most commonly used compilers. But the algorithm is perfectly happy to become intimately acquainted with such statistical details, and for this reason it tends to be much better than humans at extracting signatures from compiled viruses. It is easy to tell when the algorithm is working on such a virus, because almost all of the candidate signatures have very high estimated probabilities. In almost every case, the algorithm locates the pockets that contain good signature material, and chooses a signature from one of them.

We are well on the way to understanding the nature of what went wrong in the very few cases where the algorithm has selected signatures that have later proved to yield false positives. We are very optimistic about one particular idea that we think will lead to substantial improvements in the algorithm's performance on compiled viruses (stay tuned!)

  

figure154

Figure: Histogram of estimated signature probabilities for Virus Bulletin signatures from 1991. Black histogram represents virus signatures responsible for one or more false positives.

  

figure163

Figure: Number of times that each of the six ``bad'' signatures of Fig. 4 was found in the corpus, using fuzzy matching criteria. Note that all of the bad signatures have log probabilities that are much higher than our chosen threshold. In other words, the automatic algorithm would not have come close to selecting any of these poor signatures.

Another way to evaluate the performance of the algorithm is to find an alternative source of virus signatures and then check to see how the false-positive rate correlates with the probability estimated by the algorithm. The Virus Bulletin is a very convenient source of signatures. As an experiment, we took a large set of signatures that had been published in Virus Bulletin over the course of several months during mid-1991, and used the algorithm to estimate their false positive probabilities. Then, we incorporated the Virus Bulletin signatures (which were typically 16 bytes in length) into IBM's virus scanner.

The Virus Bulletin signatures were only intended to be used with exact matching. However, in order to encourage them to produce false positives, we turned on fuzzy matching, which declared a match if 12 or more of the 16 bytes matched, and scanned the corpus to see which signatures (if any) caused ``false positives''.

Out of 267 signatures, 6 yielded one or more false positives. As demonstrated in Fig. 4, the signatures that caused false positives were those for which the estimated probability was much greater than average. The signature with the highest estimated probability, Kamikaze, turned out to be the most notorious false-positive generator; it was found over 60 times in the corpus (see Fig. 5). In many cases, there were just 2 mismatched bytes. Jocker, which the algorithm claimed was one of the 5 worst signatures, turned out to be the second worst offender in the scanner test; it hit on about 10 files in the corpus.

In short, the automatic algorithmic did an excellent job of identifying signatures that were more at risk for generating false positives.


next previous up

Next 5- Application: Computer Immune System
Previous 4.1- Characterizing the algorithm using random sequences
Up 4- How Good is the Algorithm?


Back To Index