CHAID

CHAID, for Chi-square Automatic Interaction Detector (or Detection, depending upon the source consulted), is an exploratory method used to study the relationship between a dependent variable and a series of predictor variables. CHAID modeling selects a set of predictors and their interactions that optimally predict the dependent measure. The developed model is a classification tree (or data partitioning tree) that shows how major "types" formed from the independent (predictor or splitter) variables differentially predict a criterion or dependent variable. The Measurement Group uses a second-generation CHAID algorithm called Exhaustive CHAID analysis as implemented in the SPSS Answer Tree Program (Version 2.1 and 3.1).


Overview: CHAID modeling is an exploratory data analysis method used to study the relationships between a dependent measure and a large series of possible predictor variables that themselves may interact. The dependent measure may be a qualitative (nominal or ordinal) one or a quantitative indicator. For qualitative variables, a series of chi-square analyses are conducted between the dependent and predictor variables. For quantitative variables, analysis of variance methods are used where intervals (splits) are determined optimally for the independent variables so as to maximize the ability to explain a dependent measure in terms of variance components.

Reading a CHAID Diagram: CHAID diagrams should be thought of as a "tree trunk" with progressive splits into smaller and smaller "branches." The initial "tree trunk" is all of the participants in the study. A series of "predictor" variables are assessed to see if splitting the sample based on these predictors leads to a statistically significant discrimination in the dependent measure. For instance, if our dependent measure is quality of life and our potential predictor variables (or splitting variables) are client characteristics, we would first assess whether there are different levels of quality of life for two or more groups formed on the basis of one of the predictor variables. The "most significant" of these would define the first split of the sample, or the first branching of the tree. Then, for each of the new groups formed, we would ask if the subgroup could be further significantly split by another of the predictor variables. And so on. After each split, we ask if the new subgroup can be further split on another variable so that there are significant differences in the dependent variable. The result at the end of the tree building process is that we have a series of groups that are maximally different from one another on the dependent variable. At each step, statistical tests are made to determine if a significant split can be made (correcting very conservatively for the fact that we are examining many possible ways of splitting the data at one time). In the example, the ultimate result would be a series of groups defined by one of more of the predictor variables, that are different from one another in overall quality of life levels. Note that the tree can be pictured in an orientation from "top-to-bottom" or "left-to-right" or "right-to-left" and that the results are identical. Different orientations of the same tree are sometimes useful to highlight different portions of the results. [Also note that on the various CHAID diagrams in our work, certain boxes have thicker lines around them. This is a "quirk" of the Answer Tree program that at least one box must be highlighted in the diagrams. In our work we usually highlight the top (base) node of the tree, but at other times might highlight the final or terminal nodes, or certain notes of special interest given the topic addressed or hypothesis tested.]


Click graphic to expand.
Click graphic to expand.

[This is the same tree as that shown later a second time below with the orientation changed to show a slightly different way of looking at the same result. The model shows the total health-related quality of life score for a group of HIV/AIDS patients at the time of their enrollment into innovative service programs for their HIV disease.]

Advantages: The CHAID method has certain advantages as a way of looking for patterns in complicated datasets. First, the level of measurement for the dependent variable and predictor variables can be nominal (categorical), ordinal (ordered categories ranked from small to large), or interval (a "scale"). Second, the level of measurement for the predictor variables can be nominal, ordinal, or interval. Third, not all predictor variables need be measured at the same level (nominal, ordinal, interval). Fourth, missing values in predictor variables can be treated as a "floating category" so that partial data can be used whenever possible within the tree. Fifth, if an appropriately conservative set of statistical criteria are used, the resulting models will primarily emphasize strong results without over-capitalizing on chance. On the other hand, it must always be remembered that CHAID modeling is essentially a "stepwise" statistical method and that there is always a potential for too much to be seen in the data even when very conservative statistical criteria are used. Nonetheless, in those cases in which there is not a strong theory in an area that would clearly indicate which variables are, and are not, probably predictors of some dependent measure, CHAID will be very useful in identifying major data trends.

Known Issues/Problems with the Method: The program Answer Tree used here permits a Bonferroni-type probability to be used to correct for the number of different ways a single predictor variable can be split (see Biggs, deVille, and Suen, 1991). The program does not permit one to correct for the number of potential splitter (predictor) variables being considered. Also, Monte Carlo studies have not established the implications of mixing nominal, ordinal, and continuous indicators in the prediction of either a nominal, ordinal, or continuous dependent variable. Monte Carlo studies have also not been extensively used to study the implications of different potential ways of handling missing observations. Additionally, CHAID is primarily a step-forward modeling fitting method. Known problems with step-forward regression fitting models are probably applicable for this method of analysis. Finally, CHAID is a sequential fitting algorithm and its statistical tests are sequential with later effects being dependent upon earlier ones, and not simultaneous as would be the case in a regression model or analysis of variance where all effects are fit simultaneously.

Programs: All analyses in this Knowledge Base use the Answer Tree computer program published by SPSS. Our analyses use the Exhaustive CHAID method, which tends to be more computationally difficult, but which produces more accurate results. In most cases, in addition to the analyses presented here, we have conducted alternate analyses using alternate methods, or with alternate ways of setting up the same problem, to confirm the general pattern of results presented is not dependent upon the statistical analysis method.

Technical Options: Typically the technical options used for the analyses include the following: Bonferroni .05 adjustment of the probabilities; a minimum parent node size of 10; a minimum child node size of 5; the ability to split or combine continuously the categories of predictor variables. In some cases, these technical options are adjusted for the sample size or based on prior knowledge about the variables. Look for a hyperlink to Technical Options in various Knowledge Items to show the exact way the program was set up for the analyses presented in that Knowledge Item. When sample sizes are large or the variables are fairly "coarse" ones, the minimum parent node size is sometimes set at 20 and the minimum child node size at 10.

Note on CHAID as a Modeling Mechanism: CHAID is a useful method of summarizing data, and can show major natural divisions of the clients by various defining variables. It must be recognized, however, that CHAID is analogous to a "forward" stepwise regression analysis and has all of the possible attendant difficulties of such stepwise regression. The models presented should be considered as suggestive, but not absolutely definitive as there may be alternate models that may also fit the data in a statistically or theoretically acceptable manner. Note that in most cases, fairly conservative modeling methods are used because Bonferroni confidence intervals are used to correct individual predictor variables. In virtually all cases here, we have used statistical criteria which are fairly conservative, so we do not show every possible "significant" relationship, but instead focus on those that are "important" in a statistical sense, and presumably more likely to be replicated in new samples.

Technical Citation: The analyses conducted here use the algorithm discussed by D. Biggs, B. deVille, and E. Suen (1991), A method of choosing multiway partitions for classification and decision trees, Journal of Applied Statistics, 18(1), 49-62. Biggs, deVille, and Suen show that their algorithm more correctly protects standard statistical testing assumptions than earlier CHAID and AID (Automatic Interaction Detection) algorithms. We thank SPSS for giving us access to their internal statistical algorithms so that we could fully understand the calculation steps made in this proprietary software.

[This is the same tree as that shown above with the orientation changed to show a slightly different way of looking at the same result. The model shows the total health-related quality of life score for a group of HIV/AIDS patients at the time of their enrollment into innovative service programs for their HIV disease.]

Click graphic to expand.
Click graphic to expand.

Also see: The implications of using alternate algorithms to develop classification trees for the types of data typically analyzed in studies by The Measurement Group.

Also see: General index of data mining applications (including those using CHAID) conducted by The Measurement Group.


TheMeasurementGroup.com Glossary Index

 


© Copyright 1999-2005 by The Measurement Group LLC. All rights reserved.