Skip to main content


next previous up

Next 3- The Extraction/Evaluation Algorithm
Previous 1- Introduction
Up Automatic Extraction of Computer Virus Signatures

2- Virus Scanners and Signatures

For simplicity, let us first consider file-infecting viruses that infect their host programs with exact or near-exact copies of themselves. Self-garbling viruses, particularly those which are polymorphic, have naturally received most of the hype and hoopla, but a majority of PC DOS viruses -- even those found to in actual incidents -- are of this ``simple'' type.

Suppose that we wish to determine whether a particular host program P is infected with a particular simple virus V. The most obvious method would be to scan the machine code of P, looking for a pattern of bytes that exactly matched V. However, there are several practical problems with this approach. Typical computer viruses are a few hundred to a few thousand bytes in length. Given that there are several thousand PC DOS viruses (and thus several thousand patterns to be matched), the amount of memory required just to contain all of the patterns would be several megabytes, which would be prohibitive. Second, it would be dangerous for anti-virus software to contain a large library of known viruses. Virus writers would no doubt be very grateful if we were to make virus collections so easily accessible to them.

Rather than requiring an exact match, the typical practice is to use just a small piece of the virus code as a means for identification. These short templates, called signatures, are much easier to handle, and reveal nothing useful to virus authors. There is an additional, very important advantage to using short signatures: they still work even when other parts of the virus change.

There are two sources of viral mutation. First, viruses are sometimes modified deliberately by humans who wish to produce new viral strains without having to take the trouble to write one from scratch. However, the new strain will still be detected unless the change is made somewhere in the sequence of bytes corresponding to the virus signature.gif

The second source of viral mutation is programmatic. A small but growing minority of viruses are programmed to modify their form deliberately whenever they replicate in an attempt to evade detection by virus scanners. Generally, this is done by garbling the main portion of the virus using a key that is selected randomly at the time of replication. When the transmuted virus is loaded into memory and executed, the first several bytes of code (the ``head'') degarbles the main portion of the virus, which is then executed. However, the ``head'' is often a fixed sequence of bytes, and a signature can be selected it. A small portion of today's viruses are able to overcome this deficiency by randomly generating a large number of different heads, which may or may not possess the same functionality. These polymorphic viruses present more of a challenge to anti-virus technology. So far, human experts have been able to devise detection algorithms for all such viruses; the algorithms tend to employ methods that are much more complex than simple signature scanning.

Although short signatures allow one to capture much possible variation, there is a significant drawback to their use. The shorter the signature, the more likely it is to be found in some perfectly legitimate code just by coincidence. To minimize this possibility, experts choose sequences of bytes which look unusual to them. This tedious technique works adequately for the most part. However, it is not uncommon for a new release of a virus scanner to be followed quickly by numerous reports of false positives generated by one of the new signatures. Some in the anti-virus industry feel that false positives are a worse problem than viruses themselves because they create panic, confusion, and unnecessary work, and undermine people's faith in anti-virus technology. No matter what the signature, one can not guarantee that false positives will never occur, but it is imperative that the false-positive probability be reduced to acceptable levels.

Human-selected signatures for viruses that have been written in high-level languages are particularly prone to false positives because most of the generated code is taken verbatim from libraries specific to the compiler. Unless the expert spends an unconscionable amount of time familiarizing himself with the boring details of the machine code typically generated by every C, FORTRAN, Pascal, etc. compiler in the world, he will not be qualified to judge whether a particular sequence of bytes is really unusual. This is a job for a computer, not a human being!

In order to increase the likelihood of capturing new variations of a previously-known virus, some virus scanners (for example the one employed by IBM AntiVirus) recognize inexact matches between scanned bytes and signatures as a mutant strain of a known virus. The benefits of capturing a broader range of mutations must be weighed against the increased probability of a false positive. Again, human experts are not in a good position to make such a tradeoff because it is nearly impossible for them to quantify how much more risk is involved.

To summarize, a signature must satisfy two conflicting criteria. The signature (and its associated matching criterion) must capture a broad variety of conceivable mutations for a particular virus, but the false-positive probability must be as low as possible. Ultimately, the tradeoff must be based on human judgment, but that judgment must be founded on a reasonable estimate of the false-positive probability. The next two sections of this paper describe an algorithm for obtaining such an estimate.


next previous up

Next 3- The Extraction/Evaluation Algorithm
Previous 1- Introduction
Up Automatic Extraction of Computer Virus Signatures


Back To Index