Automatic Extraction of Computer Virus SignaturesJeffrey O. Kephart and William C. Arnold
In Proceedings of teh 4th Virus Bulletin International Conference, R. Ford, ed., Virus Bulletin Ltd., Abingdon, England, 1994, pp. 178-184
Abstract:
One way that anti-virus programs identify the presence of a virus
in an executable file, a boot record, or memory is by using short
identifiers called signatures, which consist
of sequences of bytes in
the machine code of the virus. A good signature is one that is
found in every object infected by the virus, but is unlikely
to be found if the virus is not present; i.e. the likelihood
of both false negatives and false positives must be minimized.
Typically, a human expert chooses a signature for a new virus by means of a
laborious, time-consuming procedure. Unfortunately, the accelerating
influx of new computer viruses threatens to outpace the ability of
human experts to analyze and find signatures for them.
To help alleviate this burden, we have developed a statistical method
for automatically extracting good signatures from the machine code of a
virus. The basic idea is to characterize statistically a large corpus
of programs (currently about half a gigabyte), and then to use this
information to estimate false-positive probabilities for proposed virus
signatures. In effect, the algorithm extrapolates from the corpus to
the much larger universe of executable programs which do or might exist.
In practice, signatures extracted by this method are very unlikely to
generate false positives, even when the scanner that employs them permits
some mismatches.
This patent-pending technique has been used
to either extract or evaluate the more than 2500 virus signatures used
by IBM AntiVirus. It obviates the need for a small army of virus analysts,
permitting IBM's signature database to be maintained by a single virus
expert working halftime.
|