3- Measuring Prevalence in a Given EnvironmentHow do we go about answering the three major categories of questions posed in the last section (which we shall refer to as Q1 through Q3)? As was stated, we are limited to sampling some chosen subset of computer users, typically within a particular type of environment to which we have access. One approach, which was taken by Certus in 1990 and 1991 and by Dataquest in 1991, is to survey a large number of organizations by contacting the person within each who is most responsible for troubleshooting virus problems. Based upon their surveys, Certus and Dataquest drew conclusions about some aspects of the trends of particular viruses over time and several details of virus incidents (Q1.B and Q2, respectively). Although these surveys do shed some light on the virus problem, many of the results are suspect because they rely on the accuracy of people's recollection. In some cases, respondents were asked to recall events which happened up to two years in the past, and it is not clear how many of them kept accurate records of virus incidents. Underreporting of old virus incidents would not be surprising under such circumstances. We feel that a much more reliable way to answer these questions is to collect statistics on virus incidents directly from a large chosen population as they occur. For each incident, we must record (at a minimum) where and when the incident occurred, what virus was involved, and how many machines were affected. Other details of virus incidents (e.g. other sub-categories of Q2) would be useful to record as well. This method requires a population with three important characteristics:
We are still left with the other questions which were posed in the previous section. The question of how many different viruses exist in the world (Q1.A) has been debated hotly by several computer virus collectors and pundits. Continuance of the debate is nurtured by several factors:
We are content to let people enjoy their debate. Eventually, some good may come of it. However, we feel that it is much more important to know which few of the untold hordes of viruses are worth worrying about (Q1.B). The set of questions regarding user behavior (Q3) is important, but can not be answered by monitoring virus incidents. A user survey could be useful, although its reliance on people's ability to quantify their own behavior accurately could introduce a substantial amount of error [1]. Software tools which monitor user behavior in certain limited environments might be very useful supplements to such a survey. Some additional remarks apply to all of the questions posed in the previous section. Whether the information is obtained via surveys or by closely monitoring a large population of users, care must be taken in gathering and interpreting the data. We must define the quantities we are trying to measure carefully and make sure that the data we gather actually measure those quantities and that they are accurate. Even if the data are accurate, there are many pitfalls that must be avoided in their interpretation. For example, both the Certus and Dataquest studies have blurred the distinction between the number of incidents and the number of infected machines, and have in some cases failed to distinguish between the number of infections by one particular virus and the number of infections due to all known viruses. As we will see, attempting to extrapolate from noisy or incomplete data is another such pitfall.
|