Skip to main content


next previous up

Next 4.2- The false-positive record
Previous 4- How Good is the Algorithm?
Up 4- How Good is the Algorithm?

4.1- Characterizing the algorithm using random sequences

Although using random sequences to characterize the relationship between estimated and actual probabilities sounds like a fine idea in principle, there is a hitch: a randomly generated byte sequence of any length is extremely unlikely to be found in any corpus. For example, there are tex2html_wrap_inline512 different sequences of length 16; the probability of any given one of them being found in a 1 gigabyte corpus would be about tex2html_wrap_inline514 .

The obvious solution is to reserve a section of the corpus, and choose from it ``random'' sequences that have a much better chance of having something in common with the sequences found in the rest of the corpus. In these experiments, the corpus (containing thousands of DOS, Windows, and OS/2 executable programs, comprising roughly half a gigabyte) is divided into three partitions: a small ``probe'' set, and training and test sets of roughly equal size. On the order of 10,000 sequences of s (where s is typically in the range of 12-24) contiguous bytes are drawn randomly from the probe set, and treated as candidate signatures. The probability estimation algorithm is run against the candidate signatures and the training corpus, while the number of instances of each candidate signature in the test corpus is tallied and divided by the size of the test corpus to obtain an ``actual'' probability for each signature. Then, the actual and estimated probability for each of the candidate signatures are plotted against one another on a logarthmic scale.

A typical result is displayed in Fig. 2. In this case, the candidate signature length was s=24, and no mismatches were permitted. Ideally, one would like all of the pairs of estimated and actual probabilities to fall on the dashed line, which represents perfect agreement between the two probabilities. Obviously, the algorithm falls far short of this ideal. The actual probabilities are almost always greater than the estimated probabilities because the correlations among bytes that are separated by several bytes -- which are assumed to be negligible in the approximation in Eq. 4) -- are in fact strongly positive.

  

figure125

Figure: Estimated vs. actual exact-match probabilities for 10,535 24-byte sequences selected randomly from the probe set. Dashed line indicates equality between estimated and actual probabilities. For the several thousand probe sequences which never appeared in the test corpus, the logarithm of the actual probability is tex2html_wrap_inline522 . The estimated log-probabilities for these ``good'' sequences varied from approximately -165 to -18, resulting in a nearly-continuous vertical line at the left-hand side of the figure. The vertical striations to the right of it correspond to sequences which appeared once, twice, etc. in the test corpus. The estimated log-probabilities for these ``bad'' sequences also varied over a considerable range.

At first glance, Fig. 2 seems to contain nothing but bad news. However, the ultimate goal is not accurate probability estimation; it is simply to discriminate good signatures from bad ones. In fact, the results of Figure 2 show that this is quite feasible! Consider the vertical stripe at the left-hand side of Fig. 2. It corresponds to sequences which never appeared in the test corpus. Thus, we shall refer to them as ``good'' sequences. The cloud of points to the right of it corresponds to sequences which appeared one or more times in the reduced corpus. These we refer to as the ``bad'' sequences. In using the automatic extraction/evaluation algorithm, we must select a threshold for the estimated log-probability such that it excludes all or practically all bad sequences without eliminating too many good sequences. In this case, setting the estimated log-probability threshold at approximately -70 excludes practically all of the bad sequences, while apparently there are still many good sequences which would not be excluded. However, in order to tell exactly what proportion of good sequences are not excluded, we must look at the data in a different way.

In order to select a reasonable log-probability threshold T, we first need to compute two quantities as a function of T: the number of good sequences which are accepted by the threshold T, tex2html_wrap_inline530 , and the number of bad sequences which are accepted by T, tex2html_wrap_inline534 . Note that tex2html_wrap_inline536 and tex2html_wrap_inline538 represent the total number of good and bad sequences, respectively. Then, in order not to reject too many good sequences, we want to minimize the false-rejection probability tex2html_wrap_inline540 , which can be done by making T as large (close to 0) as possible.

On the other hand, we also want to minimize the false-positive probability tex2html_wrap_inline544 , which requires that T be made as low as possible. Figure 2 displays the false-rejection and false-positive probabilities derived from the data presented in Fig. 2.

  

figure143

Figure: False-rejection and false-positive probabilities as a function of log-probability threshold T, derived from the data presented in Fig. 2.

Now we are in a position to make a well-informed tradeoff between false rejection and false positives. In this example, the false-positive probability is zero for tex2html_wrap_inline550 , at which point the false-rejection probability is 48%. We can lower the false-rejection probability further by choosing T;SPMgt;-68, but only if we are willing to accept the chance of false positives. This can be seen clearly from Fig. 2. However, if we are willing to tolerate a false-positive probability of 0.5%, we can reduce the false-rejection probability to 36% by increasing T to -60.

What is a reasonable tradeoff? We think it is important to keep the probability of a false positive occurring in a collection of software distinct from (but similar in size to) our half-gigabyte corpus to at most a few tenths of one percent. To help illustrate what this means, suppose that we have a large collection of software containing no duplicates. In order for there to be a reasonable chance of obtaining a false positive, the size of the collection would have to be at least several hundred gigabytes. This may be on the order of the total amount of software in common usage in today's world.

The flip side of the tradeoff is the false-rejection probability. If we are extracting signatures from virus code, we can often live with a reasonably high false-rejection probability; certainly more than 50%, and perhaps even 80% or 90%. Viruses typically contain at least several hundreds or thousands of byte sequences from which to choose. We only need one or perhaps a few signatures, so we can afford to throw away many good ones. However, there are situations in which we don't have the luxury of several dozen signatures from which to choose. For example, for self-garbling viruses we must choose the signature from the head, which can be fairly short. In this case, we might be forced to raise the threshold and accept a higher false-positive probability. In the case of signature evaluation, the signature is presumably one that has been recommended by an expert. It ought to be judged on the same basis as the extracted signatures. If the signature is rejected by the threshold, the proper course of action is to obtain a sample of the virus and extract the signature automatically.

In the case of Fig. 3, one need not struggle much to find a reasonable compromise. A choice of T somewhere in the range between -60 and -70 would satisfy both of our criteria quite well.

It is interesting to note that, in Fig. 3, the false-positive probability is quite high when T=0 -- approximately 34%. In other words, out of a corpus of approximately half a gigabyte, if one chooses at random a 24-byte sequence, there is a 34% chance of finding that same 24-byte sequence somewhere else in the corpus. The moral of this is that sheer length offers little protection from the risk that a signature will generate false positives. Cleverness, be it human or algorithmic, is an essential ingredient in choosing good computer virus signatures.

We emphasize that the term ``false-positive'' probability has been used above to mean the probability of a byte sequence being found in a body of programs both statistically similar to and comparable in size to our corpus. To allow for the fact that the number of programs that exist or could exist in the world exceeds the number of programs in the corpus by a considerable margin, it might be prudent to diminish the threshold probability by a factor of 10 or 100. In other words, perhaps the log-probability threshold T should be reduced by 4 or 5. However, this may be overly conservative because a majority of virus code is written in assembler, not the high-level languages in which most of today's PC software applications are being written. Thus selection of thresholds based upon studies of probes taken from the corpus itself is likely to be overly pessimistic for viruses, which are somewhat atypical software. Practical experience indicates that, very roughly, the two effects cancel one another.


next previous up

Next 4.2- The false-positive record
Previous 4- How Good is the Algorithm?
Up 4- How Good is the Algorithm?


Back To Index