IBM®
Skip to main content
    Country/region [change]    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    

IBM Systems Journal

Information-Based Medicine   Volume 46, Number 1, 2007
Table of contents: HTMLPDF This article: HTML PDFDOI: 10.1147/sj.461.0135Copyright info

A graph-theoretical approach for pattern discovery in epidemiological research

by R. A. Mushlin,
A. Kershenbaum,
S. T. Gallagher,
and T. R. Rebbeck

In this paper we describe a graph-theoretical approach for pattern discovery that is especially useful in epidemiological research when applied to case-control studies involving categorical features such as genotypes and exposures. Whereas existing approaches are limited to exploring relationships among two or three factors, or deal with thousands of genes but are unable to isolate interactions among individual genes, we focus on interactions among tens of genes. We present a pattern discovery algorithm that finds associations among multiple factors, such as genetic and environmental factors, and groups of individuals (cases and controls) in a clinical survey. To validate our approach and to demonstrate its effectiveness, we applied it to a selection of synthetic data sets that were devised to emulate the situations encountered in epidemiological studies involving common diseases with suspected associations involving multiple factors that could include inherited genotypes, somatic genotypes, demographic characteristics, or exposures. The results of this experiment show that it is possible to identify the effects of multiple factors in moderate-size surveys (involving hundreds of individuals) even when the number of factors is greater than three.

Introduction

One of the key promises of genomic medicine is the ability to predict susceptibility to complex diseases based on knowledge of inherited genotypes, somatic genetic changes, and environmental exposures. A great deal of effort has been invested in identifying the role of genes, exposures, lifestyles, and other factors in causing certain individuals to develop diseases or to exhibit poor prognoses when diagnosed. The problem is complicated by the fact that different combinations of genotypes and exposures can lead to the same disease, but may result in different levels of response to treatment or toxicity to drugs. Predicting disease risk and drug response has traditionally been the work of epidemiologists and pharmacologists. As genes have been found to play a major role in disease etiology and drug response, the fields of molecular epidemiology and pharmacogenomics have assumed much of the burden of studying these effects in detail.

When multiple factors combine to determine a person's risk of disease or response to treatment, it is often difficult to sort out the contributions of these various factors, and to identify the combinations of these factors that are relevant to disease etiology, outcome, or drug response. Tools are needed that can efficiently search the high-dimensional feature space and discover patterns associated with a disease etiology. Standard statistical approaches have traditionally dealt only with interactions among two or three factors; new approaches are needed to deal with higher-order interactions.

In this paper we describe a pattern discovery and analysis method based on modeling the risk factors, the individuals, and the discovered patterns as graph constructs, without reference to any underlying functional (biological) model. The method itself consists of four phases:

  1. Construct a graph that represents the risk factors associated with each individual.

  2. Find patterns in the graph that correspond to groups of individuals with identical risk factors, and quantify the risk and significance for each pattern.

  3. Construct a lattice that represents the relationships among the patterns.

  4. Enumerate the interesting and significant risk factors and subpopulations.

Once the risk factor combinations and their affected populations have been identified, domain experts can compare these associations to the predictions derived from functional and etiological models, thereby strengthening or weakening the evidence for a particular model.

The complex interactions of multiple factors in disease etiology, outcome, or drug response are difficult to detect. Often the order of the interaction is high, and the main effects of each of these factors individually may be weak. A number of methods have been proposed to evaluate higher-order interactions among genes and other risk factors, including recursive partitioning,1,2 random forests,3 combinatorial partitioning,4 permutation-based procedures,5 multivariate feature selection,6 multivariate adaptive regression splines,7 boosting,8 support vector machines,9 neural networks,10,11 Detection of Informative Combined Effects (DICE),12 logistic regression,13 penalized logistic regression,14 Bayesian pathway modeling approaches,15,16 Focused Interaction Testing Framework (FITF),17 consensus algorithms,18 and Classification and Regression Trees (CART). Another approach, multifactor-dimensionality reduction,19 has been recently shown to be a special case of CART.20 In particular, CART models have been widely applied and have the ability to detect complex interactions among multiple etiological factors. However, this method may assume an underlying model of association, may require assumptions about the identification of “purity” in the groupings identified, or may miss interactions that are not consistent with early splitting patterns. Our approach allows the detection of complex interactions among multiple etiological factors without making such assumptions.

The use of Bayesian graphical models to identify candidate genes in genome-wide association studies has recently been described.21 Efficient algorithms for discovering association rules among features in very large databases have long been used commercially for market basket analysis,22 but practical considerations limit the complexity of the discovered rules to a modest number of features. Our method, conversely, is aimed at a reduced set of already identified candidate gene polymorphisms. These polymorphisms may act in complex combinations to affect disease risk. Our method can handle this complexity and can shed light on the chemical pathway changes induced by combinations of polymorphisms.

The rest of the paper is organized as follows. We begin by describing our overall approach. We then give a detailed description of the implementation of our algorithm. Next we present computational evidence validating our approach. Finally, we summarize our contributions and suggest areas for future research.

Our approach

This section defines the basic concepts upon which our procedure is based. These include graph-theoretic concepts, epidemiological concepts, and set-theoretic concepts. We also describe the set-theoretic operations which form the basis for our algorithm.

Concepts and definitions

We use graphs to capture the relationship between a set of individuals and the allowed values of a specified set of features. The individuals and the feature values together make up the nodes of the graph. We distinguish between these two types of nodes by treating the feature values as source nodes (s-nodes), and the individuals as terminal nodes (t-nodes). We connect an s-node to a t-node with an edge if that feature value is exhibited by that individual. This results in a bipartite graph, a graph in which every node is one of two types (in our case, either an s-node or a t-node). Moreover, edges exist only between nodes of different types, never between nodes of the same type (Figure 1). Such a bipartite graph can be built to represent all or part of the data in the study. Hereafter, we refer to a bipartite graph as a graph, for simplicity.

Figure 1 Figure 1

A subgraph that consists of a set of nodes with edges between all pairs of nodes in the set is called a clique. Bipartite graphs cannot have (nontrivial) cliques because there can be no edges between any pair of nodes of the same type. There is, however, an analogous concept, called a biclique. A biclique is a subgraph defined by two sets of nodes where there is an edge between every node in the first set and every node in the second set. A maximal biclique is a biclique that is not contained in any larger biclique in the parent graph.

Figure 1 depicts a bipartite graph with t-nodes {pi} representing people and with s-nodes {fi} representing features. The node sets {p4, p6, p7}, {f3, f6, f8} and all the edges between them define a maximal biclique within the larger graph. We are interested in maximal bicliques because in our application they represent the largest set of people who share a common set of features. When applied to genotype association studies, each feature fi is one genotype (e.g., “AA”) for one polymorphic locus (e.g., “GENE1”). Thus, a maximal biclique containing a set of specific genotypes for multiple loci would also contain all the individuals who share that exact combination of genotypes for those loci. For simplicity, in the remainder of this paper, we use “clique” to mean “biclique”.

As the number of features increases, the number of people who share those features decreases. Each set of features generates a maximal clique. Maximal cliques and the relationships between them can be viewed as a lattice. A lattice consists of a set and a partial ordering (the “less-than” relationship, “<”) such that for each pair of elements in the set, x and y, there are four possibilities: (1) x < y, (2) y < x, (3) x and y are equal, and (4) x and y are unrelated.

The “<” relation is transitive; that is, if x is < y and y is < z, then x is < z. It is also anti-symmetric; that is, if x is strictly < y, then y cannot be strictly < x. In our case, we define a lattice on the cliques. In particular, the cliques are associated with sets of people and sets of features, and we define a notion of “<” in terms of subset relations on these sets. This will be described in detail in a later section.

In describing our method we make repeated use of constructs from both epidemiology and graph theory. Cases and controls are the two values of a binary classification variable used in an association study that define the dependent variable in the analysis. For example, cases can be those affected with a disease, those that have an adverse outcome in a longitudinal follow-up study, or those that have an adverse reaction to a drug in a pharmacogenetics study. Typically, a study is trying to determine if some exposure confers a risk of being a case. The exposure is the independent variable and can include inherited genotypes, somatic genotypes, chemical exposures, demographic characteristics, or any other risk factor of interest. In our application, we extend the notion of exposure to mean having a particular set of values for a specific group of features under study. Thus we shift from considering the individual features as independent risk factors to viewing a pattern of features as a single risk factor. This pattern may summarize information from a large number of independent variables. Here, we limit our discussion to binary independent and dependent covariates, but our approach can be extended to include polytomous variables (variables that take values from a discrete set) without loss of generality.

Risk measures

To quantify a pattern for the independent variables, we use a 2 × 2 table with meanings assigned to the rows and columns as shown in Table 1. The values of a, b, c, and d are counts of individuals having the indicated pattern of exposure and affection (case/control) status. The value of a is referred to as the support for the pattern. Using this table framework, many metrics can be derived to make inferences about the relationship of the dependent and independent variables. For assessing risk in case-control studies, the odds ratio (OR) is commonly used:

OR = (a · d)/(b · c).

The odds ratio can range from 0 to ∞. For OR > 1, we infer that the pattern confers risk; for OR < 1, we infer that the pattern confers protection against affection. The null hypothesis yields OR = 1, which is interpreted as the pattern being unassociated with the dependent variable. To linearize and balance the risk measure around the null hypothesis, it is common to convert to a logarithmic scale. Here, we use log10(OR) as the risk measure (log10 is convenient, but the natural logarithm is also commonly used), and other risk measures could be considered, such as positive likelihood ratio.


Table 1 Representation of the 2 × 2 contingency table that forms the basis of our approach
 CasesControlsRow totals

Have patternabNwith
Do not have patterncdNwithout
Column totalsNcasesNcontrolsNtotal

The probability p of obtaining a particular table a, b, c, d is given by:

Equation a

where a, b, c and d occupy the cells in the 2 × 2 table indicated in Table 1. Note that, when the row and column totals (margins) are fixed, the only degree of freedom in this expression is one of the interior values such as a. When the odds ratio for a particular table having a = a0, OR(a0), is > 1, the probability of obtaining a table with OR ≥ OR(a0) by chance is the P-value for the table having a = a0, and is given by:

Equation b

where p(a) is the probability for the observed table defined by a ≥ a0, and the sum is over all values of a ≥ a0 that keep the margins constant. A similar expression exists for OR < 1, and the sum is over a ≤ a0.

Consider an epidemiological case-control study represented as a case graph and a control graph. We first determine how many cases share the same values for a given set of features. This corresponds to the largest subgraph of the case graph in which all given s-nodes are fully connected to a set of t-nodes, and where neither the source nor terminal set can be enlarged without reducing the size of the other. Such a subgraph is an instance of a maximal clique. The source set of such a maximal clique is the pattern of independent variables (feature values), its terminal set is the support set, and the terminal set's cardinality is the value of a in the 2 × 2 table (Table 1). The cardinality of the terminal set from the maximal clique in the control graph having the same source set would determine the value of b (Table 1). Since Ncases and Ncontrols are known and fixed, the 2 × 2 table for the pattern would be determined by these two counts. If the pattern of interest were known in advance, it would be a simple matter to search the case and control graphs for the desired clique. The challenge, however, is to evaluate the risk associated with every pattern, composed of every combination of features available in the study. This corresponds to exhaustively searching the graphs for all maximal cliques, and evaluating each one for risk and statistical significance. An exhaustive search is possible but may become intractable as the number of features and values per feature increases. Thus, we have implemented an algorithm that incorporates user-defined constraints to limit the complexity of the search, but is exhaustive within those bounds.

Set operations for maximal cliques

We construct a bipartite graph, G = (S,T,E), where S and T are disjoint sets of nodes and E is a set of undirected edges, e = (s,t), where s-node s is in S and t-node t is in T. Figure 1 illustrates such a graph for disjoint sets {p} and {f}. (The assignment of sets to S and T is arbitrary. Our method consistently assigns the feature values to S and the people to T.) Our goal is to find all maximal cliques B = (SB, TB, EB) of G, where SB, TB, and EB are subsets of S, T, and E, respectively, and there is an edge e = (sB, tB) for all pairs of nodes in SB and TB. A clique B is said to be maximal if there is no other clique B’ = (S’B, T’B, E’B), where SB is a (proper) subset of S’B, or TB is a subset of T’B. The inner boxed portion of Figure 1 shows a maximal clique within a bipartite graph.

Our method operates on two types of candidates, which we refer to as s-cliques and t-cliques. For each s in S, we form an s-clique, C(s) = [{s}, T(s)] where T(s) is the set of all t such that there is an edge (s, t) in E. Similarly, for each t in T we have a t-clique C(t) = [S(t), {t}]. All the candidate cliques we identify can be described as generalizations of this form. Specifically, given any set, S, of sources, we have an s-clique C(S) = [S, T(S)] where T(S) is the set of t such that there exist edges (s, t) for all s in S. Similarly, we have t-cliques C(T) = [S(T), T]. The basic operation which is used to expand cliques is

C(S1 ∪ S2) = [(S1 ∪ S2), T(S1) ∩ T(S2)] for s-cliques,

and similarly,

C(T1 ∪ T2) = [S(T1) ∩ S(T2), (T1 ∪ T2)] for t-cliques.

We use s-cliques in discussing the algorithm further, but the same arguments can be applied to t-cliques.

The expansion operation used to identify maximal cliques is depicted in Figure 2A. Clique C(S1) is expanded using C(S2). The resulting clique contains the union of the source sets and the intersection of the terminal sets. In expanding s-cliques, S2 usually contains a single element, with one important exception. Given any source set, S1, and its associated T(S1), we can identify the set X(S1) of sources which can extend S1 without decreasing T(S1). We call this the extension set of S1. X(S1) is defined by X(S1) = {s| T(S1)⊂T(X(S1))}. This is equivalent to saying that (T(S1)∩T(X(S1)) = T(S1). Thus, T(S1) is not decreased by adding s to S1. The operation of adding the largest X(S1) to S1 forms a maximal clique. By definition, neither the extended S1 nor T(S1) can be increased without decreasing the other. We thus define the operation of adding X(S) to S as taking the closure of S. This situation is illustrated in Figure 2B. Clique C(S1) is extended using C(X(S1)). The resulting clique contains the union of the disjoint source sets and the intersection of the completely overlapping terminal sets, which is identical to the original terminal set. This condition defines the extension X(S1). For convenience, we refer to single-element cliques together with their extensions as singletons. As an example of this, consider the node f8 in Figure 1. It has the people p4, p6, and p7 associated with it. This is a clique in the general sense, but it is not a maximal clique, the type of clique we wish to consider in our analysis. In particular, p4, p6, and p7 are also associated with f3 and f6. Thus, once we choose to include f8 in a clique, we could include f3 and f6 as well, without losing any people. We therefore do not consider f8, p4, p6, and p7 to define a singleton, but instead immediately add f3 and f6 to the clique. We have in effect acquired f3 and f6 “for free.” The latter clique is the desired maximal clique. The former clique is not.

Figure 2 Figure 2

Implementation

In this section, we give a schematic diagram of the algorithm and a description of each of its major components. We trace the flow of information and control from one component of the algorithm to another, describing figures of merit, constraints, how we deal with missing data and, finally, the output the algorithm produces.

Components and flow

A schematic diagram of the program components and flow is shown in Figure 3. The program starts by building the bipartite graphs from the tables of raw data (typically, flat files). The raw case and control data, containing values for each of the features fi for each individual pi, are converted into bipartite graphs. An external mapping table is used to convert the raw data values into discrete, categorical feature values for use as s-nodes. Feature values not present in the raw data are mapped to a categorical value reserved for missing data. The t-nodes are derived from the raw record identifiers.

Figure 3 Figure 3

The program then proceeds to search for maximal cliques in the case graph. The control graph is simply searched for s-nodes (feature value sets) that match those discovered in the case graph to obtain the counts for the 2 × 2 table. Clique discovery in the controls is thus avoided. The search is primed by finding all singleton cliques, including extensions, (maximal by construction, see above) by inspection. A copy of the singleton list is kept for use in clique expansion (see below).

Each clique is assigned a value for its figure of merit (FOM) and checked against user-specified constraints. Typical FOM choices include P-value and OR, but can include other measures derived from the 2 × 2 table. Typical filter constraints include minimum or maximum number of s-nodes and t-nodes, and minimum or maximum values of FOM. Cliques that meet the constraints are sent to an output file (they are acceptable), and placed in the candidate queue for expansion. The candidate queue is prioritized by FOM; that is, the clique with the best FOM is the first to be selected for expansion. A user-specified maximum queue size is imposed, based on available system memory. When the limit is exceeded, candidates at the bottom of the queue (with the worst FOM) are discarded. To improve efficiency, every candidate in the queue has an associated data structure in which the list of those singletons that have at least one t-node in common with its own t-nodes is maintained. We refer to this list as the neighbor set.

The program searches for new acceptable candidates by removing the candidate from the top of the queue for expansion. The selected clique is merged with each of the cliques in its neighbor set and extended if possible (see the section “Set operations for maximal cliques”). If the new candidate meets the external constraints and is not a duplicate of an existing clique, it is sent to the output file and inserted into the prioritized candidate queue on completion of the current expansion cycle. By first extending each newly formed candidate and then checking a hash table for duplicates, we avoid the quadratic process of having to compare every new clique to every existing one in order to determine maximality. The performance of the algorithm is linear in the product of the number of s-nodes, the number of t-nodes, and the number of cliques. After the queue has been augmented with all the acceptable new candidates, the new top element is removed for expansion, and the cycle is repeated until the queue is empty. The output file contains the maximal cliques that could be built from the cliques in the queue which met the constraints, along with the 2 × 2 table and statistics for each clique.

Figure of merit and constraints

With an infinite queue size and no constraints, the algorithm finds all acceptable patterns. For small problems this may be practical, but for problems where the number of possible patterns exceeds the queue size, some patterns will never be expanded, and it is possible that a complex pattern of interest may never be built because none of its precursors are still on the queue and available for expansion. In practice, this can be avoided by choosing a FOM that is expected to be high for immediate precursors of interesting patterns. This choice depends on the model system under study, but the algorithm does not presuppose any particular model.

The ability to externalize and tailor the queue-ordering function to the presumed shape of the landscape (shape of the FOM function in a multidimensional space) is actually a strength of the method that could be exploited in some situations. For our simulation we used the P-value as the basis for prioritizing the queue. Statistically, this is a “neutral” measure, in the sense that it measures the confidence in the result, not the strength of the result.

Although it is possible to leave a set of interacting features “stranded” on the pattern landscape, that would require that all paths from singletons to the pattern in question be discarded. For short patterns, we would have to discard only a few, even shorter, patterns. We handle these shorter patterns at the start of the process when the queue is relatively empty; therefore, we are unlikely to discard them for lack of space. For long patterns, the number of ways the pattern could be built up grows factorially; therefore, it is unlikely that we would discard all the paths. (For 10 genes there are a maximum of about 1 million patterns, which would require about 1 gigabyte of memory. In a real study, the number of actual patterns with any reasonable support is much less than the maximum; thus, much less storage is required in practice.)

Unlike the FOM, which is used to prioritize cliques for expansion, constraints are used to directly filter the patterns that are put into the queue and reported as output. This prevents wasting queue space and interpretation effort on patterns that the user decides in advance would not be of interest. These constraints apply even when the queue is not full. We typically apply constraints to s-node and t-node counts, odds ratio, and P-value.

Final candidates and the lattice of cliques

When the algorithm halts, the output contains all the maximal cliques that could be built from the cliques in the queue that met the constraints. We call these the final candidates. Each final candidate is reported with its s-set (features), t-set (individuals), 2 × 2 table with any accompanying statistics, and FOM. The goal is to decide which patterns have the feature set that best predicts whether an individual is a case.

A few caveats are in order. First, the features that predict a disease or a drug response do not necessarily cause the disease or the response. Second, the results obtained from the sample of the population making up the data set may not be valid for the population as a whole, or for all segments of the population.

When we speak of a pattern's feature set, and the number of individuals who do and do not exhibit the pattern, we must be clear about what we are counting. Consider a data set with binary features A, B, and C. Every individual is either “1” or “0” for each of the features, for a total of 23 = 8 possible unique records. But there are 33 = 27 possible patterns. That is because the feature set for a pattern is the set of s-nodes that directly participates in the maximal clique, and this set implies “any value” for all features not explicitly mentioned in the set. For example, the set {A1, B1} specifically excludes A0, B0, but implicitly includes either C0 or C1. This can be written as {A1, B1, C*}, and has a t-set containing individuals with A = 1, B = 1, and C = any. Thus, each feature has three possible values: “1”, “0”, and “*”; hence, 33 = 27 possible patterns.

The preceding example leads to an inherent ordering of patterns. Given two maximal cliques, C1(S1,T1) and C2(S2,T2), S1 ⊂ S2 if-and-only-if T1 ⊃ T2, and S1 ⊃ S2 if-and-only-if T1 ⊂ T2. This pair of properties allows us to construct a lattice of patterns from the algorithm output. Recall from the earlier section “Concepts and definitions,” that a lattice is a collection of objects (cliques, in this case) and a partial ordering. The cliques themselves become nodes in the lattice graph, and edges exist between nodes corresponding to pairs of cliques for which the partial ordering relation (in this case, the subset relation) holds.

Sometimes not every node in the lattice is represented because the pattern list may be incomplete due to queue limitations or applied constraints, or both. In addition, the lattice may be filtered to eliminate uninteresting or statistically nonsignificant patterns. The resulting structure is a filtered lattice which is a collection of subgraphs of the full lattice. We refer to this structure simply as the lattice. A simple example of a lattice whose elements correspond to maximal cliques is shown in Figure 4. Feature sets (labeled in uppercase) become more generalized moving down and more specialized moving up. Support (labeled in lowercase) becomes broader moving down and narrower moving up. Connected nodes satisfy the partial ordering requirements for a lattice. For example, we could say that the clique [BCDF;cd] is “less than” [BCD;acd], if we use the subset relation between feature sets as the “less-than” relation (BCDF ⊂ BCD*)

Figure 4 Figure 4

The four highlighted nodes illustrate the effect of progressively specifying features. With only feature C fixed (yellow), four individuals are found with the pattern. When feature A is incrementally added (orange), one individual (d) is lost. Further specifying feature E (red) loses one individual (a). If B and D are specified (green), however, two are lost (b,c).

Notice that some of the intermediate nodes have been filtered out, in which case edges are simply inserted to bypass the “ghost” nodes. The arrow in Figure 4 points to the location where patterns with features ABC and ACD would have been. Adding either B or D singly to AC may lead to ghost nodes for several reasons. They could have been intentionally filtered as uninteresting or nonsignificant, in which case their effect must have been for each of them to have removed one individual (b or c) apiece from the [AC;abc] pattern. Otherwise, either B or D removed both b and c, while the other had no effect. If adding B had removed both b and c, then the resulting pattern [ABC;a] would not have been a maximal pattern, since [ABCD;a] has the same support. If adding D had had no effect, then AC would not have been maximal, and [ACD;abc] would have replaced [AC;abc] in the lattice.

If the lattice is augmented with the s-set, t-set, and risk statistics, it can be a powerful aid in reasoning about the relationships among patterns. The algorithm expands cliques in order to search for combinations of feature values that together confer risk, but that individually, or in subsets, may not. Application of parsimony concepts is deferred until after all the cliques are discovered. At that point, one can use the lattice to trade off features for support; that is, a more general (smaller) description of the feature set covers more people, but often at the expense of conferred risk (odds ratio). The algorithm described here does not attempt to optimize such a trade-off.

Missing data, correlated features, and quantitative traits

A final consideration for the practical application of this method is how missing data are handled. Options include imputing missing data by statistical inference or omitting entire records containing any missing data, but neither are optimal solutions; whereas data imputation is effectively used in family-based studies where Mendelian or similar models can be used to predict unknown genotypes, this approach may be prone to misclassification in case-control or cohort studies of unrelated individuals. Similarly, omission of entire subjects limits the power for analysis of other variables. Statistical inference requires that the features be either uncorrelated or the correlations known, but it is those very correlations that we are trying to detect. Omitting records is wasteful of data that are often hard to collect, and even if the data were plentiful, the missing data may not be randomly distributed among records, thus introducing bias.

We propose to make use of the available data wherever possible. We exclude from all t-sets any individual that does not have one of the allowed values for every feature in the s-set. In other words, a missing feature value is never a match to any feature value. But a missing value for a feature not in the s-set does not in and of itself exclude an individual from t-set membership. This rule is most noticeable when counting the number of individuals that do not have a particular pattern of feature values. To be counted, they must have some value for every feature in the pattern, and at least one of the values must be different from all members of the pattern feature set. As a result of this treatment, Ncases and Ncontrols are not constant across all patterns. When the pattern-to-pattern margin fluctuations are large, the P-values for patterns with identical OR can vary noticeably.

Single nucleotide polymorphisms (SNPs) on the same chromosome tend to be correlated to a degree more or less proportional to their proximity, a phenomenon called linkage disequilibrium (LD). LD is especially strong for SNPs in the same gene. When LD is present in the SNPs being studied, if either SNP is found to be a member of a clique, the other SNP will also tend to be a member of the same clique. If the LD is known in advance, as could be assumed for SNPs on the same gene, then one might collapse them into a single variable without significantly affecting the predictive power of the clique. However, in drawing inferences from the clique SNPs about their possible effects on pathway kinetics and disease processes, one must keep in mind that even if two SNPs are correlated, they both, individually or together, can still contribute to the pathology.

Quantitative variables, as might be used to study gene-environment interactions, could be included alongside SNP and other categorical variables by binning (placing in bins). Such analyses are not presented here, but the binning operation is conveniently performed during the feature-value mapping operation as shown in Figure 3.

Validation

To demonstrate the utility and validity of our method, we applied it to a selection of synthetic data sets. The data sets were devised to emulate the situations encountered in epidemiology studies involving common diseases having suspected associations with multiple factors that could include inherited genotypes, somatic genotypes, demographic characteristics, or exposures. For example, the features may consist of data on SNPs, and the dependent variable may be a particular type of cancer.

Data-set construction

For this application, we undertook a simple example that considered the potential effect of nine genes as independent variables and a binary disease-dependent variable, case-control status. At each gene, we specify a SNP of interest with two alleles. We designate the common, or major allele A and the rare, or minor allele a. Because somatic cells (cells other than egg and sperm) have paired (diploid) chromosomes with one allele from each parent for each SNP, three possible allele combinations, or genotypes, arise: AA, Aa, and aa. If the frequency of A is p, then the frequency of a is 1− p. Under ordinary circumstances, these two alleles combine in binomial proportions (i.e., Hardy-Weinberg equilibrium, or HWE, for short) to form three genotypes, such that

freq(AA) = p2
freq(aa) = q2
freq(Aa) = 2pq

The minor allele frequencies were supplied as parameters, and the three genotype frequencies were calculated for each gene assuming HWE, resulting in an assortment of genotype frequencies across the nine genes, assuming biallelic polymorphisms. Each individual was randomly assigned a genotype for each gene, with probability freq(genotype). A genotype pattern was then specified as having an enhanced risk. Each individual was then checked for a genotype match to the specified risk pattern and labeled a case or control by using penetrance parameters R1 = prob(case|match) and R0 = prob(case|not match) respectively, until a given number of cases and controls were generated.

As an illustration, consider a set of genes in which the minor allele frequencies are all 10 percent. HWE then gives freq(AA) = 0.9*0.9 = 0.81, freq(Aa) = 2*0.9*0.1 = 0.18, and freq(aa) = 0.1*0.1 = 0.01 for each gene. Using these frequencies, we assign genotypes for all genes to a population such that the genotype frequencies in the population approximate HWE. For example, each person is assigned G1 = AA with probability 81 percent, G1 = Aa with probability 18 percent, and G1 = aa with probability 1 percent. In this example, a population of 1000 people would thus have about 810 people with genotype AA, 180 people with genotype Aa, and 10 people with genotype aa, for each gene. We then specify a (multi-)gene pattern as conferring an increased risk of disease. For example, people with G1 = AA and G2 = Aa could have a 20 percent risk, that is, p(case|match) = 0.2, whereas everyone else, including people that match only one of the two genotypes in the pattern, has a 10 percent risk, that is, p(case|not match) = 0.1. Note that there is no increased risk for a partial match. We assign people to the cases with 20 percent probability if they have G1 = AA and G2 = Aa, and with only 10 percent probability otherwise. In this example, 0.81*0.18 = 0.1458, or about 146 of the 1000 people would match, and of those, about 20 percent, or 29 people, would be cases, and 80 percent, or 117 people, would be controls. Of the 854 people who did not match, about 10 percent, or 85 people, would be cases, and 90 percent, or 769 people, would be controls. The 2 × 2 table for such a pattern is shown in Table 2.


Table 2 The 2 × 2 contingency table for one of the data sets used in the validation runs
G1 = AA and G2 = AaCasesControlsRow totals

Match29117146
Do not match85769854
Column totals1148861000
OR = 2.24, P = 0.000829, FOM = 3.08

Results and analysis

We used a variety of parameter values to construct a series of situations for application of our pattern discovery method. Populations were generated that had pattern complexity, allele or genotype frequencies, and risk (i.e., odds ratio) effects that are typical of those seen in molecular epidemiology studies. Among the properties explored were the effects of the complexity of the doped pattern (intentionally risk-enhanced pattern) and its overall frequency in the population (Nwith / Ntotal in Table 1) on the ability of the algorithm to reliably detect the pattern. All populations were generated with Ncases = 500 and Ncontrols = 1000. The controls used were a random sample of the much larger number of controls generated. Pattern complexity was varied to include from one to five genes, and estimated pattern frequency, computed as the product of the individual genotype frequencies, ranged from 0.1 percent to 81 percent. The risks R0 and R1 were kept constant at 10 percent and 20 percent, respectively, resulting in an average odds ratio of 2.3. We used statistical significance, expressed as −Log10(P-value), as a FOM for prioritizing the queue and ranking the results. The only constraint was that the minimum fraction of cases with the pattern of interest (support) was set to 5 percent for all the runs. For some of the rarer patterns, additional runs were made using 1 percent support.

Table 3 summarizes the results of some representative runs. The results of each run show the observed overall frequency, odds ratio, FOM = −Log10(P-value), and rank of the discovered pattern out of the total patterns reported. A “−” indicates “pattern not discovered.” At 5 percent support, the method reliably picks up patterns down to 5 percent frequency, and detects even rarer patterns. At 1 percent support, there are roughly 10 times more patterns reported, and reliable detection drops off at around 3 percent frequency. A larger population would be needed for reliable results at the 1 percent level. The combined run mixes two populations and resolves their components.


Table 3 Results of representative validation runs
Run  N Genes Component Genotype Freqs. (%)Estimated Freq. (%)Observed Freq. (%)Odds RatioFOM Rank Total

A. Single-doped patterns run at minimum support of 5 percent
 
118181.082.72.57.811839
 
2281, 7359.362.62.413.411953
 
314949.051.52.718.111236
 
4381, 73, 6639.042.62.211.511837
 
5481, 73, 66, 5922.929.22.08.811945
 
612525.025.32.715.011034
 
7249, 4622.524.32.714.511260
 
8581, 73, 66, 59, 5312.014.52.37.411833
 
9237, 3513.013.92.05.211188
 
10349, 46, 439.711.32.57.511211
 
11225, 256.38.52.14.11983
 
1215.45.47.22.03.211753
 
13337, 35, 324.25.63.26.411131
 
1413.63.64.12.74.011736
 
1517.77.79.32.46.121776
 
16149, 37, 275.06.51.92.741134
 
17452, 46, 40, 353.34.02.22.791683
 
18449, 46, 43, 403.94.32.22.9151144
 
1912.12.11845
 
20325, 25, 251.61017
 
21549, 46, 43, 40, 371.41158
 
22111.01796
 
23425, 25, 25, 250.41003
 
B. Single-doped patterns run at minimum support of 1 percent
 
2412525.025.32.715.0111734
 
2513.63.64.12.74.0110601
 
26225, 256.38.52.14.1211685
 
27149, 37, 275.06.51.92.73812138
 
2812.12.11.92.51.921610199
 
29325, 25, 251.62.21.91.3105911651
 
30111.01.22.01.0232210005
 
31425, 25, 25, 250.411633
 
32525, 25, 25, 25, 250.111614
 
C. Double-doped patterns run at minimum support of 5 percent
 
33a237, 3513.014.21.53.481145
 
33b349, 46, 439.710.41.63.761145

Figure 5 summarizes the coverage, detection, and #1 ranking for the runs in Table 3. Runs marked with a triangle detected the test pattern. Runs marked with asterisks ranked the test pattern #1. Of the 23 runs with 5 percent support, the test pattern was detected in 18, ranked #1 in 14, and in the top 10 in 17. Of the five patterns not detected at 5 percent support, four were rerun with 1 percent support, and three were detected, but not with high ranks due to insufficient sample size for such rare patterns. Nevertheless, for frequencies in the range typically encountered in clinical studies of common diseases, the pattern of interest is clearly visible.

Figure 5 Figure 5

The combination run demonstrates the utility of the method for resolving multiple etiologies in heterogeneous populations. The populations of runs 9 and 10, with two and three genes respectively, were mixed by combining the original two case files and two control files into a single pair with 1000 cases and 2000 controls. The algorithm was run with 5 percent minimum support. The two components were found at ranks #6 and #8. Note that the odds ratio and FOM are degraded from the separate component values because in the mixed population, the individuals from population #9 that happen to have pattern 10 do not have the enhanced risk of the individuals from population 10 with the same pattern, and vice versa. As a result, the odds ratios of both components are reduced to values consistent with a risk of 15 percent for each pattern in the combined population, compared to 20 percent in the separate populations. (We can average R0 = 10% and R1 = 20% because the proportions of both cases and controls in the mixture are equal.) To test this prediction we generated the two original populations 10 times each, with R1 = 15% instead of 20% (data not shown). The observed odds ratio for the components in the mix was within 1 standard deviation of the mean odds ratio for the separate components with R1 = 15%. Given this worst-case mix scenario, the observed ranks (#6 and #8) are an indication of the strength of the overall method.

The lattice structure described previously is used to inspect the neighborhoods of the two components of the mixed pattern. Figure 6 shows the relationships between the designated patterns 9 and 10 having ranks #8 and #6, respectively, and other patterns that ranked better. Rank #7 is a 3-gene specialization of the 2-gene component with rank #8. Ranks #4 and #2 are 4-gene specializations of the 3-gene component with rank #6, and ranks #3 and #1 overlap with two of the 3-gene component's features. Overall, all but one (#5) of the top 8 patterns differ from the designated patterns by either a single addition or a single substitution. The protective pattern with rank #5 has no features in common with any of the other patterns. This is not surprising, because one might expect protective patterns and risky patterns to have little in common.

Figure 6 Figure 6

The rankings shown in Table 3 are based on P-value, and thus reflect an estimate of the type-1 error associated with statistics derived from the 2 × 2 table. For real data, the 2 × 2 table itself would have estimated errors for all four cells based on the genotyping error rate and the missing data rate. Quantitative estimates of these errors are beyond the scope of this paper. One could, however, explore the distribution of FOM, and in particular, the distribution of best FOM, for an ensemble of data sets with randomized affection status. This was not done for our synthetic examples because we could compare the list of cliques directly with the known true patterns.

Summary and future directions

We have described a graph-theoretical approach to searching for patterns in categorical data and demonstrated its ability to reliably detect and identify patterns that confer risk in situations typical of molecular epidemiological case-control studies. In particular, our method performs well on multivariate patterns that often present a challenge for traditional methods. The technique can be used to sort out multiple patterns conferring independent risks, indicating its potential for resolving multiple etiologies. We also showed how the use of the lattice concept can be helpful in understanding the relationships among discovered patterns.

An effort to expand the scope of our approach and enhance the features and performance of the implementation is under way. Until now we have focused on the effects one clique (or family of cliques) at a time. The individual elements of a clique are implicitly connected with the logical AND operator. When considering more than one clique from separate families, the combination is aptly described using the logical OR operator between the cliques. Hence, a full description of a combination of cliques would involve a mixed Boolean expression. We are currently working on an extension to our method that automatically detects such combinations and simplifies their Boolean descriptions.

Another area for future work involves more extensive analysis of the overlaps in source sets and terminal sets within the lattice. The strict definition of clique can be relaxed, allowing sets with less than complete connectivity to participate in patterns, perhaps more closely approximating the situations encountered in real studies. We are exploring additional enhancements to the algorithm, including more complex FOM and constraints. We are currently engaged in analyzing data from a real epidemiological study involving genetic and environmental risk factors for breast and endometrial cancer.

Cited references

Accepted for publication August 22, 2006; Published online December 31, 2006.


    About IBMPrivacyContact