3.2- Choosing the best signatureBefore describing the selection process, it is worth noting two things. First, the aforementioned procedure for generating the candidate signatures is not crucial. Any other technique, including manual labor by a human expert, could be used. Second, the number of candidate signatures could be very small -- even just one. This would be appropriate if one were using the signature extractor to evaluate a signature that has been chosen previously, most likely by a human expert.
The key idea behind the algorithm is to estimate the probability that
each of the candidate signatures will match a randomly-chosen block of
bytes in the machine code of a randomly-chosen program, either exactly
or with some specified number or pattern of mismatches.
In the case of extraction, we select one or more signatures with the
lowest estimated ``false-positive'' probabilities of all the candidates,
making sure that this probability is less than some established threshold.
In the case of evaluation,
we just place a seal of approval on a signature if its estimated
false-positive probability is less than the established threshold.
Thus the problem of signature extraction or evaluation is reduced to
the following problem:
For a given sequence of S bytes
The most obvious way to compute
where
However, there is a serious
problem with this technique. Suppose that the corpus is
reasonably large -- on the order of a gigabyte or so.
For relatively common sequences (ones that could be expected
to appear several times per gigabyte), the probability
estimate given by Eq. 1
would be reasonably accurate. Somewhat common
sequences (ones that could be expected to appear once
or twice per gigabyte, or once in every few gigabytes)
might or might not appear in the corpus. Extremely
rare sequences almost certainly would not appear in
the corpus. It is readily apparent from these
considerations that the technique prescribed in
Eq. 1 has very little ability
to discriminate between somewhat common and very
uncommon sequences. Another way to express this is that
the dynamic range of possible probability estimates
yielded by this method is inadequate; it cannot produce
estimated probabilities less than
A more illustrative way to understand the inadequacy of this technique is by making an analogy to a problem encountered in training machines to understand human speech. In order to make sense of a stream of phonemes that have been extracted from an utterance, it is often useful to have a language model that is derived from a large corpus of utterances. The problem is similar to that of estimating the probability that a given sentence will be uttered, given a large corpus of previous utterances. For example, suppose that we have access to a recording of all press interviews with United States senators during the 1980's, and that we would like to estimate the probability for each of the following three sentences to be spoken by a U.S. Senator sometime during the 1990's:
1. God bless America.
2. It's all true -- we are space aliens.
3. We bless all true American space god aliens. The 1980's corpus would probably consist of tens or hundreds of millions of sentences. Within that corpus, we might expect sentence 1 to appear several times. It would be somewhat surprising to find an instance of sentence 2 in the corpus, but not absolutely inconceivable. Sentence 3 seems extremely unlikely. It is not the sort of sentence that one would expect to occur naturally in ordinary human discourse; it has the look of something generated by a somewhat incompetent computer program. The most likely scenario is that one would find several instances of sentence 1 in the 1980's corpus, but no instances of sentences 2 and 3. Under these circumstances, the method of Eq. 1 would assign an estimated probability of zero to sentences 2 and 3. It would fail to distinguish sentence 2 as being unusual, but significantly more likely than sentence 3. In fact, less than halfway through the 1990's, we find that sentence 1 has indeed been uttered several times. But we also find that sentence 2 has been uttered at least once -- by Senator Phil Gramm of Texas, who made his stunning confession on June 7, 1994, after the question of his extraterrestial origin was posed by a reporter from The Weekly World News [2]. As far as we are aware, the world is still waiting for sentence 3. There is a critical need for an algorithm capable of distinguishing the likelihood of sentence 2 as much greater than that of sentence 3, even if neither one appears in the corpus. Fortunately, researchers in the field of speech recognition have been dealing with just this sort of problem for many years. The solution is to collect statistics on short sequences of adjacent words. Trigrams (sequences of three adjacent phonemes or words) are fairly popular in the speech community. The key insight is that reasonably good statistics can be obtained for sufficiently short sequences. Then, a simple approximation formula can be used estimate the probability of a long sequence by combining the measured frequencies of the shorter sequences from which it is composed. In sentence 2, the sequences ``It's all true'' and ``we are'' are quite common, and ``space aliens'' is only somewhat unusual, so the overall sentence can be classified as unusual, but not terrifically so. In sentence 3, none of the individual words are very unusual, but the sequences ``American space god'', and ``space god aliens'' are quite unusual, and contribute to a very low overall estimated probability for the sentence. In effect, the trigram technique (which is easily generalized to higher-order n-grams) allows one to extrapolate beyond an existing corpus to a vastly larger universe of statistically similar utterances. This permits one to discriminate between somewhat unusual and extremely unusual utterances.
The technique that we have developed for automatic
signature extraction is completely analogous to
this standard speech recognition technique. Suppose
that trigram statistics were tallied for a large
corpus. Then the estimated probability for the
sequence
where
To get some insight into the derivation of Eq. 2,
consider the simpler problem of estimating the probability of
a 4-byte sequence
where p(A|B) is to be interpreted as the probability
of byte sequence A having been preceded by byte sequence B.
Eq. 3 is exact up to this point. However,
if we suppose the correlation between byte
Inserting Eq. 4 into Eq. 3 yields
a special case of Eq. 2.
The extension of Eq. 2
to higher-order n-grams is conceptually trivial.
The technique used to extract and evaluate
IBM AntiVirus's signatures
is a more sophisticated variant of
Eq. 2 that incorporates
measured frequencies of 1-, 2-, 3-, 4-, and 5-grams.
A few additional tricks are used to solve
the problem of adequate storage for the
3-, 4-, and 5-gram frequencies
The more sophisticated variant of Eq. 2
has some additional useful capabilities. Signatures with fixed
wildcards are handled by letting the wildcards serve as
demarcations between non-wildcarded regions. The estimated
probabilities of all non-wildcarded regions are multiplied to obtain the
overall estimated probability of the signature. To calculate
estimated probabilities of signatures for which m mismatches
are permitted, one can (conceptually) generate all of the
Note that, in the speech recognition example, the trigram technique achieved good discrimination between the likelihood of sentences 2 and 3 without having any real knowledge of grammar and semantics. Similarly, the automatic signature extraction technique makes no attempt to understand code in the same way that a human expert in machine code/assembly language would. It does not know what instructions mean; in fact, it does not even know that instructions exist. It sees machine code as nothing more than a long sequence of bytes, and it applies a blind, brute-force force statistical technique to those bytes. Remarkably, this lack of understanding is more than offset by the algorithm's ability to acquire a knowledge of machine code statistics to a depth to which no human expert could or would hope to aspire. This allows the algorithm to perform extremely well, as will now be seen.
|